Message boards :
CMS Application :
Multi-core VM
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
With 7 cores: Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1.000000, Memory: 1928, Swap: 14.29%, Disk: 14.29% x7 (3+10.5)/7=1928 Someone knew more about it or coincidence where the default 'unlimited' no. of CPU's was 6 before Laurence/Nils changed it? 6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. One possible reason is, so that they do not all finish at the same time, choking the upload. I think, that is not such a bad thing. CMS-jobs are all pretty much the same duration, unlike Theory. |
Send message Joined: 22 Apr 16 Posts: 673 Credit: 1,918,671 RAC: 2,026 |
Boinc 7.6.22 Ctrl+Shift+O Select mem_usage_debug as flag to see how many memory is used. Shown in message of Boinc. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 1,887 |
I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. I'm seeing this behaviour too. I've got two 6-CPU tasks running. They both started running two jobs, and after 20 minutes they both started another two. It would seem that there are several tuning parameters that we are not fully aware of. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. First job (9550) from the six started returned after 2 hours runtime with error code 134. http://dashb-cms-job.cern.ch/dashboard/request.py/detailView?schedulerJobId=https://glidein.cern.ch/9550/161129:200356:ireid:crab:BPH-RunIISummer15GS-00046:BB_1 It did only 3000 events and then 10,000 times EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table EvtGen:Could not decay:pi0 with mass:0 will throw event away! ending with EvtGen:Your event has been rejected 10000 times! EvtGen:Will now abort. Complete process id is 7758 status is 134 The VM allocates 7896MB RAM and the VM itself is using now about 6737MB. No Swap used al all. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 1,887 |
6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. I'm afraid that failure mode is a "feature" of the underlying software, as far as I can make out. Now admittedly we are pushing it hard, looking for extremely rare decays, but if it were my software I'd be looking to understand, and preferably fix, that behaviour. Just sayin'. |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
Thanks, Laurence. Do we know where the 3 GB + N*1.5 GB for TotalMemory comes from? 3 GB +x for a singlecore? Are you sure the 3 GB does not already include the RAM usage of the first core? At least the most recent tests point in this direction. ALT+F3 (top) of the currently running task show a maximum RAM usage of 0.9 GB for each cmsRun. Most of them start around 0.55 GB. If we spend 1.5 * cmsRun for the OS of the VM (including internal cache) we roughly end around 1.4 GB. Plus 0.9 GB per core. 1 core: 2.3 GB 2 cores: 3.2 GB 3 cores: 4.1 GB 4 cores: 5.0 GB ... By the way: 2.33 GB (WP size) is the current setting of the singlecore app on the production server. The objective should be - to spend enough RAM to run the cmsRun tasks without problems - to keep the RAM requirement low so that more hosts can fulfill them |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 1,887 |
Thanks, Laurence. Do we know where the 3 GB + N*1.5 GB for TotalMemory comes from? Tja, but you have to look at total usage (I think...). It's hard for me to look at these things from home at the weekend, but when I checked my 1-CPU tasks yesterday their total memory usage was nearly 2 GB. This won't scale linearly, of course, but I can't really check until tomorrow afternoon. (My WiFi bandwidth is dominated by my weekly laptop backup just at the moment.) |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
... but you have to look at total usage (I think...). ... "top´s_total" shows the physical RAM usage including the RAM used for the cache. Your "apps_and_os total" is "cache - top´s_total". Edit: Sorry, "top´s_total - cache" of course. Edit 2: Sorry again. It should be: "top´s_total" -> physical RAM usable by the OS "top´s_used" physical RAM usage including the RAM used for the cache. "apps_and_os used" = "top´s_used - cache" |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
It seems that it is not (or not only) the memory setting that leaves some slots idle. I configured 1 VM with 3 cores and 7.5 GB RAM on the first host and 1 VM with 3 cores and 4.5 GB RAM on the other host. Both VMs started with 2 tasks and after a delay of 20 minutes they started a 3rd task (each). So far so good (or bad because of the 20 min delay). After some hours I obsered idle slots for longer periods on both VMs. Somtimes 1 idle slot, sometimes 2. At the moment the 4.5 GB VM is running 3 tasks while the 7.5 GB VM is running only 1 task. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed: 12/04/16 21:44:28 CCBListener: no activity from CCB server in 8180s; assuming connection is dead. 12/04/16 21:44:28 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 12/04/16 21:44:29 Starter pid 8791 exited with status 2 12/04/16 21:44:29 slot1: State change: starter exited 12/04/16 21:44:32 Starter pid 9135 exited with status 2 12/04/16 21:44:32 slot2: State change: starter exited 12/04/16 21:44:34 Starter pid 9897 exited with status 2 12/04/16 21:44:34 slot3: State change: starter exited 12/04/16 21:44:35 Starter pid 8312 exited with status 2 12/04/16 21:44:35 slot5: State change: starter exited 12/04/16 21:45:09 Starter pid 10316 exited with status 2 12/04/16 21:45:09 slot6: State change: starter exited 12/04/16 21:45:10 Starter pid 9550 exited with status 2 12/04/16 21:45:10 slot4: State change: starter exited |
Send message Joined: 8 Apr 15 Posts: 759 Credit: 11,771,013 RAC: 3,085 |
Besides the typical VB not starting or getting the credentials/HTCondor ping if they have to deal with my DSL with the speed of a dialup...... If I forget to d/l a new multi-core before the last one is finished it makes me d/l that .vdi AGAIN So like just now when it said "no heartbeat" and that task got removed and I didn't have a new one waiting and suspended I have to once again d/l this 254.72MB .vdi .....and this time of day it will take FIVE hours to do this. Of course if I wait 12 hours and it is past midnight I could do it in maybe 30mins. It sure would be nice if the VB tasks would sit there and wait for the entire d/l that I watch on the VM Console....instead of becoming a Error after 11mins Mad Scientist For Life |
Send message Joined: 28 Jul 16 Posts: 479 Credit: 394,720 RAC: 58 |
... and this time of day it will take FIVE hours to do this. ... Less than 10 seconds from my local squid. We had that squid discussion in another thread. My (BOINC)-squid is installed on an old, reactivated laptop with only 2 GB RAM and a slow harddisk. I use squids for several years as I had similar problems with a very slow internet connection. It is available for windows also. |
Send message Joined: 8 Apr 15 Posts: 759 Credit: 11,771,013 RAC: 3,085 |
... and this time of day it will take FIVE hours to do this. ... As usual I am running a row of these 2-core tasks that won't get to the Credential point so they make that "VM Heartbeat file specified, but missing" point and crash......this after running 5 straight with now problems I have 6 computers here all on Windows OS and this laptop has a SSD and 8GB ram and 8-cores so it isn't the problem. I even watch the network running with Speccy and the speed is fine yet this keeps happening. At the same time I run 15 LHC VB tasks every day Valids and still about 10 VB tasks at vLHC. I am on the Squid site right now and checking it out but the last thing I need is more to deal with ...... Mad Scientist For Life |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 343 |
I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed: I thought we fixed the suspending task for a much longer period: 12/05/16 09:00:32 (pid:10508) Lost connection to shadow, waiting 7200 secs for reconnect |
©2024 CERN