Message boards : CMS Application : Multi-core VM
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4385 - Posted: 1 Dec 2016, 15:56:00 UTC - in response to Message 4384.  

When we moved to plan classes to address the memory issue, we did not specify max_threads so it defaulted to 1 CPU. This value has now been set to 32 and it is working for Theory. It should also work for CMS but this has not been tested.

Tested with a 3-core VM.

First detached and reattach to the dev-project (https-url) to start clean and without using an app_config.xml

I setup for 1 job with 3 CPUs
There was a VM created with 3 cores and with 2048 base RAM (default for a single core).

As reported before the memory is too low for 3 CMS-jobs and the RAM-auto-adjusting is not working.

Result: https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290496

StartLog:
12/01/16 16:37:05 ******************************************************
12/01/16 16:37:05 ** condor_startd (CONDOR_STARTD) STARTING UP
12/01/16 16:37:05 ** /usr/sbin/condor_startd
12/01/16 16:37:05 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
12/01/16 16:37:05 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
12/01/16 16:37:05 ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $
12/01/16 16:37:05 ** $CondorPlatform: x86_64_RedHat6 $
12/01/16 16:37:05 ** PID = 4194
12/01/16 16:37:05 ** Log last touched time unavailable (No such file or directory)
12/01/16 16:37:05 ******************************************************
12/01/16 16:37:05 Using config source: /etc/condor/condor_config
12/01/16 16:37:05 Using local config sources:
12/01/16 16:37:05 /etc/condor/config.d/10_security.config
12/01/16 16:37:05 /etc/condor/config.d/14_network.config
12/01/16 16:37:05 /etc/condor/config.d/20_workernode.config
12/01/16 16:37:05 /etc/condor/config.d/30_lease.config
12/01/16 16:37:05 /etc/condor/config.d/35_cms.config
12/01/16 16:37:05 /etc/condor/config.d/40_ccb.config
12/01/16 16:37:05 /etc/condor/condor_config.local
12/01/16 16:37:05 config Macros = 156, Sorted = 156, StringBytes = 6020, TablesBytes = 5712
12/01/16 16:37:05 CLASSAD_CACHING is ENABLED
12/01/16 16:37:05 Daemon Log is logging: D_ALWAYS D_ERROR
12/01/16 16:37:05 Daemoncore: Listening at <10.0.2.15:36375> on TCP (ReliSock).
12/01/16 16:37:05 DaemonCore: command socket at <10.0.2.15:36375?addrs=10.0.2.15-36375&noUDP>
12/01/16 16:37:05 DaemonCore: private command socket at <10.0.2.15:36375?addrs=10.0.2.15-36375>
12/01/16 16:37:17 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#1360863
12/01/16 16:37:17 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state
12/01/16 16:37:17 VM-gahp server reported an internal error
12/01/16 16:37:17 VM universe will be tested to check if it is available
12/01/16 16:37:17 History file rotation is enabled.
12/01/16 16:37:17 Maximum history file size is: 20971520 bytes
12/01/16 16:37:17 Number of rotated history files is: 2
12/01/16 16:37:17 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33%
slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33%
slot type 0: Cpus: 1.000000, Memory: 1000, Swap: 33.33%, Disk: 33.33%
12/01/16 16:37:17 slot1: New machine resource allocated
12/01/16 16:37:17 Setting up slot pairings
12/01/16 16:37:17 slot2: New machine resource allocated
12/01/16 16:37:17 Setting up slot pairings
12/01/16 16:37:17 slot3: New machine resource allocated
12/01/16 16:37:17 Setting up slot pairings
12/01/16 16:37:17 CronJobList: Adding job 'mips'
12/01/16 16:37:17 CronJobList: Adding job 'kflops'
12/01/16 16:37:17 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
12/01/16 16:37:17 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
12/01/16 16:37:17 slot1: State change: IS_OWNER is false
12/01/16 16:37:17 slot1: Changing state: Owner -> Unclaimed
12/01/16 16:37:17 State change: RunBenchmarks is TRUE
12/01/16 16:37:17 slot1: Changing activity: Idle -> Benchmarking
12/01/16 16:37:17 BenchMgr:StartBenchmarks()
12/01/16 16:37:17 slot2: State change: IS_OWNER is false
12/01/16 16:37:17 slot2: Changing state: Owner -> Unclaimed
12/01/16 16:37:17 State change: RunBenchmarks is TRUE
12/01/16 16:37:17 slot2: Changing activity: Idle -> Benchmarking
12/01/16 16:37:17 slot2: Changing activity: Benchmarking -> Idle
12/01/16 16:37:17 slot3: State change: IS_OWNER is false
12/01/16 16:37:17 slot3: Changing state: Owner -> Unclaimed
12/01/16 16:37:17 State change: RunBenchmarks is TRUE
12/01/16 16:37:17 slot3: Changing activity: Idle -> Benchmarking
12/01/16 16:37:17 slot3: Changing activity: Benchmarking -> Idle
12/01/16 16:37:37 State change: benchmarks completed
12/01/16 16:37:37 slot1: Changing activity: Benchmarking -> Idle
ID: 4385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 4386 - Posted: 1 Dec 2016, 16:35:19 UTC - in response to Message 4385.  

I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing.
ID: 4386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4388 - Posted: 1 Dec 2016, 17:45:59 UTC - in response to Message 4386.  
Last modified: 1 Dec 2016, 18:00:11 UTC

I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing.

That makes no difference. Same as before. Computation error after 12½ minutes runtime. During that period only ~600MB ram was used.

With the next task I got, I manually increased the VM-RAM (without using an app_config) to 5688 MB and after 9 minutes 2 cmsRun's started (for Ivan: 3138 & 3144).
I've seen before that the 3rd one will start 20 minutes later (job 1789).

Your RAM-settings is not reaching BOINC's client_state.xml.
ID: 4388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4390 - Posted: 1 Dec 2016, 18:10:39 UTC
Last modified: 1 Dec 2016, 18:13:14 UTC

The bare minimum RAM requirements are (to the nearest 128MB)

1 CORE : 1152 MB

2 CORE : 2176 MB

4 CORE : 5248 MB

This is just for starting up at all. You might want to add a little more for making sure , it finishes.
ID: 4390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4391 - Posted: 1 Dec 2016, 18:17:11 UTC - in response to Message 4390.  

2 CORE : 2176 MB

4 CORE : 5248 MB

Typo? 4 core more than twice a dual core?

My 3 core VM with 3 cmsRun's running, is now using 3676240k.
ID: 4391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4392 - Posted: 1 Dec 2016, 18:19:58 UTC - in response to Message 4391.  
Last modified: 1 Dec 2016, 18:23:34 UTC

No typo.
Try it for yourself.

EDIT: 4 core will not start with 5120 MB
ID: 4392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4394 - Posted: 1 Dec 2016, 18:37:20 UTC - in response to Message 4386.  

I have set the base memory to 0 and then 2GB per core. Hopefully that will work then we can work on optimizing.



I believe, this can be set to 1536 MB up to 16 cores at least.

2 GB will prevent the 4th task from running on a 8 GB machine.
ID: 4394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 4395 - Posted: 1 Dec 2016, 19:33:48 UTC - in response to Message 4394.  

I have set the base memory to 1128 and then 1128GB per core. We investigate myself tomorrow if this doesn't work.
ID: 4395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4397 - Posted: 1 Dec 2016, 20:42:22 UTC

I tried a second task with your new settings without success.

https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290649

The size of RAM is the secondary problem at the moment.
Fixing the VM getting the right RAM assigned is more important now.
ID: 4397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4399 - Posted: 2 Dec 2016, 8:55:08 UTC - in response to Message 4395.  

I have set the base memory to 1128 and then 1128GB per core. We investigate myself tomorrow if this doesn't work.


This still does not solve the issue to not be able to run a 4th task on 8GB of memory.
ID: 4399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4400 - Posted: 2 Dec 2016, 9:00:54 UTC

Revised to the nearest 32MB:

The bare minimum RAM requirements are (to the nearest 128MB)

1 CORE : 1152 MB >1056 MB

2 CORE : 2176 MB >2080 MB

4 CORE : 5248 MB >5184 MB (confirmed-task finishes with valid result)

This is just for starting up at all. You might want to add a little more for making sure , it finishes.
ID: 4400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4401 - Posted: 2 Dec 2016, 9:03:46 UTC - in response to Message 4388.  

With the next task I got, I manually increased the VM-RAM (without using an app_config) to 5688 MB and after 9 minutes 2 cmsRun's started (for Ivan: 3138 & 3144).
I've seen before that the 3rd one will start 20 minutes later (job 1789).

This RAM-manual adjusted task with 3 cmsRun's running simultaneously has finished:

https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290591

The maximum RAM-usage within this 3-core VM was 4930MB and no swap was used.
ID: 4401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 4403 - Posted: 2 Dec 2016, 9:13:49 UTC - in response to Message 4395.  

It is working for me on my machine. The VM has 3384MB (1128 base + 1128 x 2CPUs). It is now running two jobs and will leave it to finish.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=290697
ID: 4403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4404 - Posted: 2 Dec 2016, 9:34:59 UTC - in response to Message 4403.  

It is working for me on my machine. The VM has 3384MB (1128 base + 1128 x 2CPUs). It is now running two jobs and will leave it to finish.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=290697

I suppose you don't use an app_config.xml, so what CMS_year_mm_dd.xml do you have and what's the contents?
ID: 4404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 4405 - Posted: 2 Dec 2016, 9:51:57 UTC - in response to Message 4399.  
Last modified: 2 Dec 2016, 9:58:17 UTC

This still does not solve the issue to not be able to run a 4th task on 8GB of memory.

I guess this could be due to a feature of the BOINC client.

The client sums up the RAM requests of all currently running BOINC tasks (RAM1 = sum_of_rsc_memory_bound_currently_running) and compares this value with RAM2 = installed_RAM * ram_max_used_xxx_pct.


If RAM2 - RAM1 < rsc_memory_bound_of_task_ready_to_start an additional task can´t start.

In the case of CMS:
rsc_memory_bound = 2.33 GB
physical RAM = 8 GB
RAM requested for 3 tasks = 6.99 GB => fits in 8GB
RAM requested for 4 tasks = 9.32 GB => too much for 8 GB

Edit: 8 GB only if ram_max_used_xxx_pct is set to 100% (via the GUI). Otherwise less.
ID: 4405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4406 - Posted: 2 Dec 2016, 9:52:31 UTC - in response to Message 4401.  

Swap file usage inside the VM is irrelevant, if performed in RAM(cache).

Only "true" disk swapping causes a significant slowdown.
ID: 4406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4407 - Posted: 2 Dec 2016, 9:54:38 UTC - in response to Message 4405.  

I think, you are correct.

However the task issuer determines the memory requirement.
ID: 4407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 4408 - Posted: 2 Dec 2016, 10:33:20 UTC

RAM request/usage is controlled via a couple of variables.

1. Via <memory_size_mb>1234</memory_size_mb> in the file CMS_yyyy_mm_dd.xml
2. Via <cmdline>--memory_size_mb 2345</cmdline> in the file sched_reply_xxx.xml
3. Via <rsc_memory_bound>1234567890.000000</rsc_memory_bound> in the file sched_reply_xxx.xml
4. Via <cmdline>--memory_size_mb 2345</cmdline> in the file app_config.xml

Higher rankings (if they exist) overrule lower rankings.
It is therefore no surprise that after a project reset we end up with VMs using 2048 MB since this is the value in the most recent CMS_yyyy_mm_dd.xml.

If a formula is used on the server to determine the RAM request of the VM - and set this value via <cmdline>--memory_size_mb 2345</cmdline> - <rsc_memory_bound> should also be set accordingly.
This is in the responsibility of the project developers.
ID: 4408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4409 - Posted: 2 Dec 2016, 10:44:15 UTC - in response to Message 4408.  
Last modified: 2 Dec 2016, 10:44:50 UTC

Sounds good.

The app_config.xml overrules the actual usage, not necessarily the amount Boinc thinks, is used.
ID: 4409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,866
RAC: 245
Message 4410 - Posted: 2 Dec 2016, 11:10:46 UTC - in response to Message 4408.  

1. Via <memory_size_mb>1234</memory_size_mb> in the file CMS_yyyy_mm_dd.xml
2. Via <cmdline>--memory_size_mb 2345</cmdline> in the file sched_reply_xxx.xml

I don't get a <cmdline>-- memory_size_mb from the project and since no app_config.xml is used the default 2048MB is used also for mt_tasks.

Q: Why is the --memory_size_mb parameter not sent to my system?
ID: 4410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : Multi-core VM


©2024 CERN