Message boards : CMS Application : Multi-core VM
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 4437 - Posted: 4 Dec 2016, 8:35:19 UTC
Last modified: 4 Dec 2016, 8:57:27 UTC

With 7 cores:
Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1928, Swap: 14.29%, Disk: 14.29%
x7

(3+10.5)/7=1928

Someone knew more about it or coincidence where the default 'unlimited' no. of CPU's was 6 before Laurence/Nils changed it?

6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval.
ID: 4437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4438 - Posted: 4 Dec 2016, 9:33:10 UTC - in response to Message 4437.  

I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval.



One possible reason is, so that they do not all finish at the same time, choking the upload. I think, that is not such a bad thing.

CMS-jobs are all pretty much the same duration, unlike Theory.
ID: 4438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,291
RAC: 3,164
Message 4440 - Posted: 4 Dec 2016, 10:59:49 UTC

Boinc 7.6.22
Ctrl+Shift+O
Select mem_usage_debug as flag to see how many memory is used.
Shown in message of Boinc.
ID: 4440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4441 - Posted: 4 Dec 2016, 12:09:42 UTC - in response to Message 4438.  

I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval.

One possible reason is, so that they do not all finish at the same time, choking the upload. I think, that is not such a bad thing.

CMS-jobs are all pretty much the same duration, unlike Theory.

I'm seeing this behaviour too. I've got two 6-CPU tasks running. They both started running two jobs, and after 20 minutes they both started another two. It would seem that there are several tuning parameters that we are not fully aware of.
ID: 4441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 4442 - Posted: 4 Dec 2016, 12:11:56 UTC - in response to Message 4437.  

6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval.

First job (9550) from the six started returned after 2 hours runtime with error code 134.
http://dashb-cms-job.cern.ch/dashboard/request.py/detailView?schedulerJobId=https://glidein.cern.ch/9550/161129:200356:ireid:crab:BPH-RunIISummer15GS-00046:BB_1

It did only 3000 events and then 10,000 times

EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
EvtGen:Could not decay:pi0 with mass:0 will throw event away!


ending with

EvtGen:Your event has been rejected 10000 times!
EvtGen:Will now abort.
Complete
process id is 7758 status is 134


The VM allocates 7896MB RAM and the VM itself is using now about 6737MB.
No Swap used al all.
ID: 4442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4443 - Posted: 4 Dec 2016, 16:44:52 UTC - in response to Message 4442.  

6-core VM is running and working OK, however I still have the unanswered question why the cmsRun's are started pairwise with 20 minutes interval.

First job (9550) from the six started returned after 2 hours runtime with error code 134.
http://dashb-cms-job.cern.ch/dashboard/request.py/detailView?schedulerJobId=https://glidein.cern.ch/9550/161129:200356:ireid:crab:BPH-RunIISummer15GS-00046:BB_1

It did only 3000 events and then 10,000 times

EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
EvtGen:Could not decay:pi0 with mass:0 will throw event away!


ending with

EvtGen:Your event has been rejected 10000 times!
EvtGen:Will now abort.
Complete
process id is 7758 status is 134


The VM allocates 7896MB RAM and the VM itself is using now about 6737MB.
No Swap used al all.

I'm afraid that failure mode is a "feature" of the underlying software, as far as I can make out. Now admittedly we are pushing it hard, looking for extremely rare decays, but if it were my software I'd be looking to understand, and preferably fix, that behaviour. Just sayin'.
ID: 4443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 4444 - Posted: 4 Dec 2016, 17:30:20 UTC - in response to Message 4435.  

Thanks, Laurence. Do we know where the 3 GB + N*1.5 GB for TotalMemory comes from?

3 GB +x for a singlecore?
Are you sure the 3 GB does not already include the RAM usage of the first core?
At least the most recent tests point in this direction.

ALT+F3 (top) of the currently running task show a maximum RAM usage of 0.9 GB for each cmsRun.
Most of them start around 0.55 GB.

If we spend 1.5 * cmsRun for the OS of the VM (including internal cache) we roughly end around 1.4 GB.
Plus 0.9 GB per core.

1 core: 2.3 GB
2 cores: 3.2 GB
3 cores: 4.1 GB
4 cores: 5.0 GB
...

By the way:
2.33 GB (WP size) is the current setting of the singlecore app on the production server.


The objective should be
- to spend enough RAM to run the cmsRun tasks without problems
- to keep the RAM requirement low so that more hosts can fulfill them
ID: 4444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4445 - Posted: 4 Dec 2016, 18:18:45 UTC - in response to Message 4444.  

Thanks, Laurence. Do we know where the 3 GB + N*1.5 GB for TotalMemory comes from?

3 GB +x for a singlecore?
Are you sure the 3 GB does not already include the RAM usage of the first core?
At least the most recent tests point in this direction.

ALT+F3 (top) of the currently running task show a maximum RAM usage of 0.9 GB for each cmsRun.
Most of them start around 0.55 GB.

If we spend 1.5 * cmsRun for the OS of the VM (including internal cache) we roughly end around 1.4 GB.
Plus 0.9 GB per core.

1 core: 2.3 GB
2 cores: 3.2 GB
3 cores: 4.1 GB
4 cores: 5.0 GB
...

By the way:
2.33 GB (WP size) is the current setting of the singlecore app on the production server.


The objective should be
- to spend enough RAM to run the cmsRun tasks without problems
- to keep the RAM requirement low so that more hosts can fulfill them

Tja, but you have to look at total usage (I think...). It's hard for me to look at these things from home at the weekend, but when I checked my 1-CPU tasks yesterday their total memory usage was nearly 2 GB. This won't scale linearly, of course, but I can't really check until tomorrow afternoon. (My WiFi bandwidth is dominated by my weekly laptop backup just at the moment.)
ID: 4445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 4446 - Posted: 4 Dec 2016, 18:39:59 UTC - in response to Message 4445.  
Last modified: 4 Dec 2016, 18:59:46 UTC

... but you have to look at total usage (I think...). ...

"top´s_total" shows the physical RAM usage including the RAM used for the cache.
Your "apps_and_os total" is "cache - top´s_total".

Edit:
Sorry, "top´s_total - cache" of course.

Edit 2:
Sorry again.
It should be:
"top´s_total" -> physical RAM usable by the OS
"top´s_used" physical RAM usage including the RAM used for the cache.
"apps_and_os used" = "top´s_used - cache"
ID: 4446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 4447 - Posted: 4 Dec 2016, 19:48:40 UTC

It seems that it is not (or not only) the memory setting that leaves some slots idle.

I configured 1 VM with 3 cores and 7.5 GB RAM on the first host and 1 VM with 3 cores and 4.5 GB RAM on the other host.

Both VMs started with 2 tasks and after a delay of 20 minutes they started a 3rd task (each).
So far so good (or bad because of the 20 min delay).

After some hours I obsered idle slots for longer periods on both VMs. Somtimes 1 idle slot, sometimes 2.
At the moment the 4.5 GB VM is running 3 tasks while the 7.5 GB VM is running only 1 task.
ID: 4447 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 4449 - Posted: 4 Dec 2016, 21:00:07 UTC

I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed:

12/04/16 21:44:28 CCBListener: no activity from CCB server in 8180s; assuming connection is dead.
12/04/16 21:44:28 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
12/04/16 21:44:29 Starter pid 8791 exited with status 2
12/04/16 21:44:29 slot1: State change: starter exited
12/04/16 21:44:32 Starter pid 9135 exited with status 2
12/04/16 21:44:32 slot2: State change: starter exited
12/04/16 21:44:34 Starter pid 9897 exited with status 2
12/04/16 21:44:34 slot3: State change: starter exited
12/04/16 21:44:35 Starter pid 8312 exited with status 2
12/04/16 21:44:35 slot5: State change: starter exited
12/04/16 21:45:09 Starter pid 10316 exited with status 2
12/04/16 21:45:09 slot6: State change: starter exited
12/04/16 21:45:10 Starter pid 9550 exited with status 2
12/04/16 21:45:10 slot4: State change: starter exited
ID: 4449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 750
Credit: 11,600,962
RAC: 1,722
Message 4450 - Posted: 4 Dec 2016, 21:21:37 UTC

Besides the typical VB not starting or getting the credentials/HTCondor ping if they have to deal with my DSL with the speed of a dialup......

If I forget to d/l a new multi-core before the last one is finished it makes me d/l that .vdi AGAIN

So like just now when it said "no heartbeat" and that task got removed and I didn't have a new one waiting and suspended I have to once again d/l this 254.72MB .vdi .....and this time of day it will take FIVE hours to do this.

Of course if I wait 12 hours and it is past midnight I could do it in maybe 30mins.

It sure would be nice if the VB tasks would sit there and wait for the entire d/l that I watch on the VM Console....instead of becoming a Error after 11mins
Mad Scientist For Life
ID: 4450 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 4452 - Posted: 4 Dec 2016, 21:42:12 UTC - in response to Message 4450.  

... and this time of day it will take FIVE hours to do this. ...
... if I wait 12 hours and it is past midnight I could do it in maybe 30mins. ...

Less than 10 seconds from my local squid.
We had that squid discussion in another thread.

My (BOINC)-squid is installed on an old, reactivated laptop with only 2 GB RAM and a slow harddisk.
I use squids for several years as I had similar problems with a very slow internet connection.
It is available for windows also.
ID: 4452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 750
Credit: 11,600,962
RAC: 1,722
Message 4453 - Posted: 5 Dec 2016, 3:36:40 UTC - in response to Message 4452.  

... and this time of day it will take FIVE hours to do this. ...
... if I wait 12 hours and it is past midnight I could do it in maybe 30mins. ...

Less than 10 seconds from my local squid.
We had that squid discussion in another thread.

My (BOINC)-squid is installed on an old, reactivated laptop with only 2 GB RAM and a slow harddisk.
I use squids for several years as I had similar problems with a very slow internet connection.
It is available for windows also.


As usual I am running a row of these 2-core tasks that won't get to the Credential point so they make that "VM Heartbeat file specified, but missing" point and crash......this after running 5 straight with now problems

I have 6 computers here all on Windows OS and this laptop has a SSD and 8GB ram and 8-cores so it isn't the problem.

I even watch the network running with Speccy and the speed is fine yet this keeps happening.

At the same time I run 15 LHC VB tasks every day Valids and still about 10 VB tasks at vLHC.

I am on the Squid site right now and checking it out but the last thing I need is more to deal with ......
Mad Scientist For Life
ID: 4453 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 4456 - Posted: 5 Dec 2016, 8:17:25 UTC - in response to Message 4449.  

I paused the 6-core VM for over 2 hours and all 6 cmsRun's were killed:

12/04/16 21:44:28 CCBListener: no activity from CCB server in 8180s; assuming connection is dead.
12/04/16 21:44:28 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.


I thought we fixed the suspending task for a much longer period:

12/05/16 09:00:32 (pid:10508) Lost connection to shadow, waiting 7200 secs for reconnect
ID: 4456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : CMS Application : Multi-core VM


©2024 CERN