Multi-core VM

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 4437 - Posted: 4 Dec 2016, 8:35:19 UTC Last modified: 4 Dec 2016, 8:57:27 UTC With 7 cores: Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1.000000, Memory: 1928, Swap: 14.29%, Disk: 14.29% x7 (3+10.5)/7=1928 Someone knew more about it or coincidence where the default 'unlimited' no. of CPU's was 6 before Laurence/Nils changed it? 6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. ID: 4437 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4438 - Posted: 4 Dec 2016, 9:33:10 UTC - in response to Message 4437. I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. One possible reason is, so that they do not all finish at the same time, choking the upload. I think, that is not such a bad thing. CMS-jobs are all pretty much the same duration, unlike Theory. ID: 4438 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 664 Credit: 1,791,291 RAC: 3,164	Message 4440 - Posted: 4 Dec 2016, 10:59:49 UTC Boinc 7.6.22 Ctrl+Shift+O Select mem_usage_debug as flag to see how many memory is used. Shown in message of Boinc. ID: 4440 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 4441 - Posted: 4 Dec 2016, 12:09:42 UTC - in response to Message 4438. I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. One possible reason is, so that they do not all finish at the same time, choking the upload. I think, that is not such a bad thing. CMS-jobs are all pretty much the same duration, unlike Theory. I'm seeing this behaviour too. I've got two 6-CPU tasks running. They both started running two jobs, and after 20 minutes they both started another two. It would seem that there are several tuning parameters that we are not fully aware of. ID: 4441 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 4442 - Posted: 4 Dec 2016, 12:11:56 UTC - in response to Message 4437. 6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. First job (9550) from the six started returned after 2 hours runtime with error code 134. http://dashb-cms-job.cern.ch/dashboard/request.py/detailView?schedulerJobId=https://glidein.cern.ch/9550/161129:200356:ireid:crab:BPH-RunIISummer15GS-00046:BB_1 It did only 3000 events and then 10,000 times EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table EvtGen:Could not decay:pi0 with mass:0 will throw event away! ending with EvtGen:Your event has been rejected 10000 times! EvtGen:Will now abort. Complete process id is 7758 status is 134 The VM allocates 7896MB RAM and the VM itself is using now about 6737MB. No Swap used al all. ID: 4442 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 4443 - Posted: 4 Dec 2016, 16:44:52 UTC - in response to Message 4442. 6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval. First job (9550) from the six started returned after 2 hours runtime with error code 134. http://dashb-cms-job.cern.ch/dashboard/request.py/detailView?schedulerJobId=https://glidein.cern.ch/9550/161129:200356:ireid:crab:BPH-RunIISummer15GS-00046:BB_1 It did only 3000 events and then 10,000 times EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table EvtGen:Could not decay:pi0 with mass:0 will throw event away! ending with EvtGen:Your event has been rejected 10000 times! EvtGen:Will now abort. Complete process id is 7758 status is 134 The VM allocates 7896MB RAM and the VM itself is using now about 6737MB. No Swap used al all. I'm afraid that failure mode is a "feature" of the underlying software, as far as I can make out. Now admittedly we are pushing it hard, looking for extremely rare decays, but if it were my software I'd be looking to understand, and preferably fix, that behaviour. Just sayin'. ID: 4443 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 62	Message 4444 - Posted: 4 Dec 2016, 17:30:20 UTC - in response to Message 4435. Thanks, Laurence. Do we know where the 3 GB + N1.5 GB for TotalMemory comes from? 3 GB +x for a singlecore? Are you sure the 3 GB does not already include the RAM usage of the first core? At least the most recent tests point in this direction. ALT+F3 (top) of the currently running task show a maximum RAM usage of 0.9 GB for each cmsRun. Most of them start around 0.55 GB. If we spend 1.5 cmsRun for the OS of the VM (including internal cache) we roughly end around 1.4 GB. Plus 0.9 GB per core. 1 core: 2.3 GB 2 cores: 3.2 GB 3 cores: 4.1 GB 4 cores: 5.0 GB ... By the way: 2.33 GB (WP size) is the current setting of the singlecore app on the production server. The objective should be - to spend enough RAM to run the cmsRun tasks without problems - to keep the RAM requirement low so that more hosts can fulfill them ID: 4444 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 4445 - Posted: 4 Dec 2016, 18:18:45 UTC - in response to Message 4444. Thanks, Laurence. Do we know where the 3 GB + N1.5 GB for TotalMemory comes from? 3 GB +x for a singlecore? Are you sure the 3 GB does not already include the RAM usage of the first core? At least the most recent tests point in this direction. ALT+F3 (top) of the currently running task show a maximum RAM usage of 0.9 GB for each cmsRun. Most of them start around 0.55 GB. If we spend 1.5 cmsRun for the OS of the VM (including internal cache) we roughly end around 1.4 GB. Plus 0.9 GB per core. 1 core: 2.3 GB 2 cores: 3.2 GB 3 cores: 4.1 GB 4 cores: 5.0 GB ... By the way: 2.33 GB (WP size) is the current setting of the singlecore app on the production server. The objective should be - to spend enough RAM to run the cmsRun tasks without problems - to keep the RAM requirement low so that more hosts can fulfill them Tja, but you have to look at total usage (I think...). It's hard for me to look at these things from home at the weekend, but when I checked my 1-CPU tasks yesterday their total memory usage was nearly 2 GB. This won't scale linearly, of course, but I can't really check until tomorrow afternoon. (My WiFi bandwidth is dominated by my weekly laptop backup just at the moment.) ID: 4445 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 62	Message 4446 - Posted: 4 Dec 2016, 18:39:59 UTC - in response to Message 4445. Last modified: 4 Dec 2016, 18:59:46 UTC ... but you have to look at total usage (I think...). ... "top´s_total" shows the physical RAM usage including the RAM used for the cache. Your "apps_and_os total" is "cache - top´s_total". Edit: Sorry, "top´s_total - cache" of course. Edit 2: Sorry again. It should be: "top´s_total" -> physical RAM usable by the OS "top´s_used" physical RAM usage including the RAM used for the cache. "apps_and_os used" = "top´s_used - cache" ID: 4446 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 62	Message 4447 - Posted: 4 Dec 2016, 19:48:40 UTC It seems that it is not (or not only) the memory setting that leaves some slots idle. I configured 1 VM with 3 cores and 7.5 GB RAM on the first host and 1 VM with 3 cores and 4.5 GB RAM on the other host. Both VMs started with 2 tasks and after a delay of 20 minutes they started a 3rd task (each). So far so good (or bad because of the 20 min delay). After some hours I obsered idle slots for longer periods on both VMs. Somtimes 1 idle slot, sometimes 2. At the moment the 4.5 GB VM is running 3 tasks while the 7.5 GB VM is running only 1 task. ID: 4447 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 4449 - Posted: 4 Dec 2016, 21:00:07 UTC I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed: 12/04/16 21:44:28 CCBListener: no activity from CCB server in 8180s; assuming connection is dead. 12/04/16 21:44:28 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 12/04/16 21:44:29 Starter pid 8791 exited with status 2 12/04/16 21:44:29 slot1: State change: starter exited 12/04/16 21:44:32 Starter pid 9135 exited with status 2 12/04/16 21:44:32 slot2: State change: starter exited 12/04/16 21:44:34 Starter pid 9897 exited with status 2 12/04/16 21:44:34 slot3: State change: starter exited 12/04/16 21:44:35 Starter pid 8312 exited with status 2 12/04/16 21:44:35 slot5: State change: starter exited 12/04/16 21:45:09 Starter pid 10316 exited with status 2 12/04/16 21:45:09 slot6: State change: starter exited 12/04/16 21:45:10 Starter pid 9550 exited with status 2 12/04/16 21:45:10 slot4: State change: starter exited ID: 4449 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 750 Credit: 11,600,962 RAC: 1,722	Message 4450 - Posted: 4 Dec 2016, 21:21:37 UTC Besides the typical VB not starting or getting the credentials/HTCondor ping if they have to deal with my DSL with the speed of a dialup...... If I forget to d/l a new multi-core before the last one is finished it makes me d/l that .vdi AGAIN So like just now when it said "no heartbeat" and that task got removed and I didn't have a new one waiting and suspended I have to once again d/l this 254.72MB .vdi .....and this time of day it will take FIVE hours to do this. Of course if I wait 12 hours and it is past midnight I could do it in maybe 30mins. It sure would be nice if the VB tasks would sit there and wait for the entire d/l that I watch on the VM Console....instead of becoming a Error after 11mins Mad Scientist For Life ID: 4450 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 62	Message 4452 - Posted: 4 Dec 2016, 21:42:12 UTC - in response to Message 4450. ... and this time of day it will take FIVE hours to do this. ... ... if I wait 12 hours and it is past midnight I could do it in maybe 30mins. ... Less than 10 seconds from my local squid. We had that squid discussion in another thread. My (BOINC)-squid is installed on an old, reactivated laptop with only 2 GB RAM and a slow harddisk. I use squids for several years as I had similar problems with a very slow internet connection. It is available for windows also. ID: 4452 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 750 Credit: 11,600,962 RAC: 1,722	Message 4453 - Posted: 5 Dec 2016, 3:36:40 UTC - in response to Message 4452. ... and this time of day it will take FIVE hours to do this. ... ... if I wait 12 hours and it is past midnight I could do it in maybe 30mins. ... Less than 10 seconds from my local squid. We had that squid discussion in another thread. My (BOINC)-squid is installed on an old, reactivated laptop with only 2 GB RAM and a slow harddisk. I use squids for several years as I had similar problems with a very slow internet connection. It is available for windows also. As usual I am running a row of these 2-core tasks that won't get to the Credential point so they make that "VM Heartbeat file specified, but missing" point and crash......this after running 5 straight with now problems I have 6 computers here all on Windows OS and this laptop has a SSD and 8GB ram and 8-cores so it isn't the problem. I even watch the network running with Speccy and the speed is fine yet this keeps happening. At the same time I run 15 LHC VB tasks every day Valids and still about 10 VB tasks at vLHC. I am on the Squid site right now and checking it out but the last thing I need is more to deal with ...... Mad Scientist For Life ID: 4453 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 4456 - Posted: 5 Dec 2016, 8:17:25 UTC - in response to Message 4449. I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed: 12/04/16 21:44:28 CCBListener: no activity from CCB server in 8180s; assuming connection is dead. 12/04/16 21:44:28 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. I thought we fixed the suspending task for a much longer period: 12/05/16 09:00:32 (pid:10508) Lost connection to shadow, waiting 7200 secs for reconnect ID: 4456 · Rating: 0 · rate: / Reply Quote

Development for LHC@home