Message boards :
CMS Application :
Multi-core VM
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
When we moved to plan classes to address the memory issue, we did not specify max_threads so it defaulted to 1 CPU. This value has now been set to 32 and it is working for Theory. It should also work for CMS but this has not been tested. Tested with a 3-core VM. First detached and reattach to the dev-project (https-url) to start clean and without using an app_config.xml I setup for 1 job with 3 CPUs There was a VM created with 3 cores and with 2048 base RAM (default for a single core). As reported before the memory is too low for 3 CMS-jobs and the RAM-auto-adjusting is not working. Result: https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290496 StartLog: 12/01/16 16:37:05 ****************************************************** 12/01/16 16:37:05 ** condor_startd (CONDOR_STARTD) STARTING UP 12/01/16 16:37:05 ** /usr/sbin/condor_startd 12/01/16 16:37:05 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 12/01/16 16:37:05 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 12/01/16 16:37:05 ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ 12/01/16 16:37:05 ** $CondorPlatform: x86_64_RedHat6 $ 12/01/16 16:37:05 ** PID = 4194 12/01/16 16:37:05 ** Log last touched time unavailable (No such file or directory) 12/01/16 16:37:05 ****************************************************** 12/01/16 16:37:05 Using config source: /etc/condor/condor_config 12/01/16 16:37:05 Using local config sources: 12/01/16 16:37:05 /etc/condor/config.d/10_security.config 12/01/16 16:37:05 /etc/condor/config.d/14_network.config 12/01/16 16:37:05 /etc/condor/config.d/20_workernode.config 12/01/16 16:37:05 /etc/condor/config.d/30_lease.config 12/01/16 16:37:05 /etc/condor/config.d/35_cms.config 12/01/16 16:37:05 /etc/condor/config.d/40_ccb.config 12/01/16 16:37:05 /etc/condor/condor_config.local 12/01/16 16:37:05 config Macros = 156, Sorted = 156, StringBytes = 6020, TablesBytes = 5712 12/01/16 16:37:05 CLASSAD_CACHING is ENABLED 12/01/16 16:37:05 Daemon Log is logging: D_ALWAYS D_ERROR 12/01/16 16:37:05 Daemoncore: Listening at <10.0.2.15:36375> on TCP (ReliSock). 12/01/16 16:37:05 DaemonCore: command socket at <10.0.2.15:36375?addrs=10.0.2.15-36375&noUDP> 12/01/16 16:37:05 DaemonCore: private command socket at <10.0.2.15:36375?addrs=10.0.2.15-36375> 12/01/16 16:37:17 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#1360863 12/01/16 16:37:17 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state 12/01/16 16:37:17 VM-gahp server reported an internal error 12/01/16 16:37:17 VM universe will be tested to check if it is available 12/01/16 16:37:17 History file rotation is enabled. 12/01/16 16:37:17 Maximum history file size is: 20971520 bytes 12/01/16 16:37:17 Number of rotated history files is: 2 12/01/16 16:37:17 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33% slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33% slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33% 12/01/16 16:37:17 slot1: New machine resource allocated 12/01/16 16:37:17 Setting up slot pairings 12/01/16 16:37:17 slot2: New machine resource allocated 12/01/16 16:37:17 Setting up slot pairings 12/01/16 16:37:17 slot3: New machine resource allocated 12/01/16 16:37:17 Setting up slot pairings 12/01/16 16:37:17 CronJobList: Adding job 'mips' 12/01/16 16:37:17 CronJobList: Adding job 'kflops' 12/01/16 16:37:17 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 12/01/16 16:37:17 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 12/01/16 16:37:17 slot1: State change: IS_OWNER is false 12/01/16 16:37:17 slot1: Changing state: Owner -> Unclaimed 12/01/16 16:37:17 State change: RunBenchmarks is TRUE 12/01/16 16:37:17 slot1: Changing activity: Idle -> Benchmarking 12/01/16 16:37:17 BenchMgr:StartBenchmarks() 12/01/16 16:37:17 slot2: State change: IS_OWNER is false 12/01/16 16:37:17 slot2: Changing state: Owner -> Unclaimed 12/01/16 16:37:17 State change: RunBenchmarks is TRUE 12/01/16 16:37:17 slot2: Changing activity: Idle -> Benchmarking 12/01/16 16:37:17 slot2: Changing activity: Benchmarking -> Idle 12/01/16 16:37:17 slot3: State change: IS_OWNER is false 12/01/16 16:37:17 slot3: Changing state: Owner -> Unclaimed 12/01/16 16:37:17 State change: RunBenchmarks is TRUE 12/01/16 16:37:17 slot3: Changing activity: Idle -> Benchmarking 12/01/16 16:37:17 slot3: Changing activity: Benchmarking -> Idle 12/01/16 16:37:37 State change: benchmarks completed 12/01/16 16:37:37 slot1: Changing activity: Benchmarking -> Idle |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 330,819 RAC: 107 |
I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing. That makes no difference. Same as before. Computation error after 12½ minutes runtime. During that period only ~600MB ram was used. With the next task I got, I manually increased the VM-RAM (without using an app_config) to 5688 MB and after 9 minutes 2 cmsRun's started (for Ivan: 3138 & 3144). I've seen before that the 3rd one will start 20 minutes later (job 1789). Your RAM-settings is not reaching BOINC's client_state.xml. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The bare minimum RAM requirements are (to the nearest 128MB) 1 CORE : 1152 MB 2 CORE : 2176 MB 4 CORE : 5248 MB This is just for starting up at all. You might want to add a little more for making sure , it finishes. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
2 CORE : 2176 MB Typo? 4 core more than twice a dual core? My 3 core VM with 3 cmsRun's running, is now using 3676240k. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No typo. Try it for yourself. EDIT: 4 core will not start with 5120 MB |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing. I believe, this can be set to 1536 MB up to 16 cores at least. 2 GB will prevent the 4th task from running on a 8 GB machine. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 330,819 RAC: 107 |
I have set the base memory to 1128 and then 1128GB per core. We investigate myself tomorrow if this doesn't work. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
I tried a second task with your new settings without success. https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290649 The size of RAM is the secondary problem at the moment. Fixing the VM getting the right RAM assigned is more important now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have set the base memory to 1128 and then 1128GB per core. We investigate myself tomorrow if this doesn't work. This still does not solve the issue to not be able to run a 4th task on 8GB of memory. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Revised to the nearest 32MB: The bare minimum RAM requirements are ( 1 CORE : 1152 MB >1056 MB 2 CORE : 2176 MB >2080 MB 4 CORE : 5248 MB >5184 MB (confirmed-task finishes with valid result) This is just for starting up at all. You might want to add a little more for making sure , it finishes. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
With the next task I got, I manually increased the VM-RAM (without using an app_config) to 5688 MB and after 9 minutes 2 cmsRun's started (for Ivan: 3138 & 3144). This RAM-manual adjusted task with 3 cmsRun's running simultaneously has finished: https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290591 The maximum RAM-usage within this 3-core VM was 4930MB and no swap was used. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 330,819 RAC: 107 |
It is working for me on my machine. The VM has 3384MB (1128 base + 1128 x 2CPUs). It is now running two jobs and will leave it to finish. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=290697 |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
It is working for me on my machine. The VM has 3384MB (1128 base + 1128 x 2CPUs). It is now running two jobs and will leave it to finish. I suppose you don't use an app_config.xml, so what CMS_year_mm_dd.xml do you have and what's the contents? |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
This still does not solve the issue to not be able to run a 4th task on 8GB of memory. I guess this could be due to a feature of the BOINC client. The client sums up the RAM requests of all currently running BOINC tasks (RAM1 = sum_of_rsc_memory_bound_currently_running) and compares this value with RAM2 = installed_RAM * ram_max_used_xxx_pct. If RAM2 - RAM1 < rsc_memory_bound_of_task_ready_to_start an additional task can´t start. In the case of CMS: rsc_memory_bound = 2.33 GB physical RAM = 8 GB RAM requested for 3 tasks = 6.99 GB => fits in 8GB RAM requested for 4 tasks = 9.32 GB => too much for 8 GB Edit: 8 GB only if ram_max_used_xxx_pct is set to 100% (via the GUI). Otherwise less. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Swap file usage inside the VM is irrelevant, if performed in RAM(cache). Only "true" disk swapping causes a significant slowdown. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I think, you are correct. However the task issuer determines the memory requirement. |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
RAM request/usage is controlled via a couple of variables. 1. Via <memory_size_mb>1234</memory_size_mb> in the file CMS_yyyy_mm_dd.xml 2. Via <cmdline>--memory_size_mb 2345</cmdline> in the file sched_reply_xxx.xml 3. Via <rsc_memory_bound>1234567890.000000</rsc_memory_bound> in the file sched_reply_xxx.xml 4. Via <cmdline>--memory_size_mb 2345</cmdline> in the file app_config.xml Higher rankings (if they exist) overrule lower rankings. It is therefore no surprise that after a project reset we end up with VMs using 2048 MB since this is the value in the most recent CMS_yyyy_mm_dd.xml. If a formula is used on the server to determine the RAM request of the VM - and set this value via <cmdline>--memory_size_mb 2345</cmdline> - <rsc_memory_bound> should also be set accordingly. This is in the responsibility of the project developers. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Sounds good. The app_config.xml overrules the actual usage, not necessarily the amount Boinc thinks, is used. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
1. Via <memory_size_mb>1234</memory_size_mb> in the file CMS_yyyy_mm_dd.xml I don't get a <cmdline>-- memory_size_mb from the project and since no app_config.xml is used the default 2048MB is used also for mt_tasks. Q: Why is the --memory_size_mb parameter not sent to my system? |
©2024 CERN