Message boards : CMS Application : Multi-core VM
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,397
RAC: 234
Message 4411 - Posted: 2 Dec 2016, 12:41:50 UTC - in response to Message 4404.  

I suppose you don't use an app_config.xml, so what CMS_year_mm_dd.xml do you have and what's the contents?


CMS_2016_03_22.xml with CMS_2016_10_31.vdi
ID: 4411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4412 - Posted: 2 Dec 2016, 13:57:42 UTC - in response to Message 4411.  

I suppose you don't use an app_config.xml, so what CMS_year_mm_dd.xml do you have and what's the contents?


CMS_2016_03_22.xml with CMS_2016_10_31.vdi

The same had/have I.
Reattached with the old not secure URL and now I got <cmdline> --memory_size_mb 3384</cmdline> with my work request (2 CPU's set)

Looks like using old and new url's is not transparent.
ID: 4412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4413 - Posted: 2 Dec 2016, 15:34:37 UTC

This whole part is missing in the sched_reply.xml when attaching with the new URL https://lhcathome.cern.ch/vLHCathome-dev/

<app>
    <name>CMS</name>
    <user_friendly_name>CMS Simulation</user_friendly_name>
    <non_cpu_intensive>0</non_cpu_intensive>
    <fraction_done_exact>0</fraction_done_exact>
</app>
<file_info>
    <name>vboxwrapper_26196_windows_x86_64.exe</name>
    <url>http://lhcathomedev.cern.ch/vLHCathome-dev/download/vboxwrapper_26196_windows_x86_64.exe</url>
    <executable/>
    <file_signature>
428d85c43d7e1a7530937ee738375f8c3604e727f6e5a7a16b0ba6256836046d
1e07a5706106299a306210b41bfcba3406f1f594797747e5b8f70dcaf0a1a85c
cf472d88aefe910e3ab367f0accf899dcafa7e8ab30bca14e6bc0dfbca0de4d5
c3c8ce370d5dadb5138acab6b9b3b32f0903744d4d4148edbccad9f8a592b58a
.
    </file_signature>
    <nbytes>1370920</nbytes>
</file_info>
<file_info>
    <name>CMS_2016_03_22.xml</name>
    <url>http://lhcathomedev.cern.ch/vLHCathome-dev/download/CMS_2016_03_22.xml</url>
    <file_signature>
27c5337125b46e12a0962231ec20aefd6b12c40d604826ac4da8a5657dbdb29c
c2d315484e7f2ef1c4d18324e7aac95e566d9f878b9a8b02b59c5bfae9296511
3542fa6969ab661ce3f43287b83a9675116dfb78661717e3f85759db7631ea90
3bd1085404be01f5eb9aec4862a332558f15cef893243746190bd87cc57a0705
.
    </file_signature>
    <nbytes>577</nbytes>
</file_info>
<file_info>
    <name>CMS_2016_10_31.vdi</name>
    <url>http://lhcathomedev.cern.ch/vLHCathome-dev/download/CMS_2016_10_31.vdi</url>
    <gzipped_url>http://lhcathomedev.cern.ch/vLHCathome-dev/download/CMS_2016_10_31.vdi.gz</gzipped_url>
    <file_signature>
847e52ae60465ee5ca009ea01e1b9ce7285264e6a449777643dcce9f15fa40c9
415ee2de817841186d83640b8b15b1d21199bb248832f70190a8d8977f626213
9b059a8003e32e5c1df130f4f8639a72b7a85d3b433ecc3263cfc1507725aac9
9f293f2b5ccd6a2b6346dfc905b686f11947faaf843f2e30dac429a1f5e81842
.
    </file_signature>
    <nbytes>1730150400</nbytes>
    <gzipped_nbytes>665579700</gzipped_nbytes>
</file_info>
<file_info>
    <name>vboxwrapper_26196_windows_x86_64.pdb</name>
    <url>http://lhcathomedev.cern.ch/vLHCathome-dev/download/vboxwrapper_26196_windows_x86_64.pdb</url>
    <executable/>
    <file_signature>
5e9905841873b9e3a47e154b042b5e14ea13ac7847d2fc3d1d0aca1044b4b374
23988a98d8b4b7e0721bb04c51c878e20387bfc8f79a5761f874a3e2ca7a69d2
609942e0e69ed16879bff9200e69409aa7686f733a0667c530951a00ebd5f53f
7f2804b785c204a8d228deb7e983a8a28875ab4b1829c0847a79c32ed489c71d
.
    </file_signature>
    <nbytes>6310912</nbytes>
</file_info>
<app_version>
    <app_name>CMS</app_name>
    <version_num>4770</version_num>
    <api_version>7.7.0</api_version>
<file_ref>
    <file_name>vboxwrapper_26196_windows_x86_64.exe</file_name>
    <main_program/>
</file_ref>
<file_ref>
    <file_name>CMS_2016_03_22.xml</file_name>
    <open_name>vbox_job.xml</open_name>
</file_ref>
<file_ref>
    <file_name>CMS_2016_10_31.vdi</file_name>
    <open_name>vm_image.vdi</open_name>
    <copy_file/>
</file_ref>
<file_ref>
    <file_name>vboxwrapper_26196_windows_x86_64.pdb</file_name>
</file_ref>
    <dont_throttle/>
    <needs_network/>
    <platform>windows_x86_64</platform>
    <plan_class>vbox64_mt_mcore_cms</plan_class>
    <avg_ncpus>2.000000</avg_ncpus>
    <max_ncpus>2.000000</max_ncpus>
    <flops>30991330939.967888</flops>
    <cmdline> --memory_size_mb 3384</cmdline>
</app_version>
<workunit>
    <rsc_fpops_est>1000000000005000.000000</rsc_fpops_est>
    <rsc_fpops_bound>2000000000000000000.000000</rsc_fpops_bound>
    <rsc_memory_bound>3548381184.000000</rsc_memory_bound>
    <rsc_disk_bound>8000000000.000000</rsc_disk_bound>
    <name>CMS_9704_1480644979.690483</name>
    <app_name>CMS</app_name>
</workunit>
<result>
<report_deadline>1481291397</report_deadline>
<wu_name>CMS_9704_1480644979.690483</wu_name>
<name>CMS_9704_1480644979.690483_0</name>
			<report_immediately/>
	
    <platform>windows_x86_64</platform>
    <version_num>4770</version_num>
    <plan_class>vbox64_mt_mcore_cms</plan_class>
</result>
<code_sign_key>
1024
d26a9d6cba06f561aabe6dab5d76b59a087d8c84b6445082d44059429a2f5c2e
f51f5ae57167c8afda52df605193dce8088016d284967af06532ac4056056d33
b19b863b1cbe0278aeedb9fe509f1a73dd50cc82e73724066e4f58b52f299abc
107391364c19db2dc7999611c745b9956cdc4ba43ca9f581b169eb51977b08d5
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000010001
.
</code_sign_key>
ID: 4413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 667
Credit: 1,807,614
RAC: 2,394
Message 4414 - Posted: 2 Dec 2016, 15:59:26 UTC
Last modified: 2 Dec 2016, 16:00:35 UTC

This task ended after TWO Minutes: https is using

2016-12-02 16:54:04 (6072): Guest Log: [INFO] Reading volunteer information
2016-12-02 16:54:04 (6072): Guest Log: [INFO] Volunteer: maeax (378) Host: 1377
2016-12-02 16:54:04 (6072): Guest Log: [INFO] VMID: 6d0ae20b-f23e-4d5d-b5ca-600a8fb1d26c
2016-12-02 16:54:04 (6072): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
2016-12-02 16:54:04 (6072): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2016-12-02 16:54:04 (6072): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2016-12-02 16:54:14 (6072): Guest Log: [INFO] CMS application starting. Check log files.
2016-12-02 16:54:14 (6072): Guest Log: [DEBUG] HTCondor ping
2016-12-02 16:54:14 (6072): Guest Log: [DEBUG] 139
2016-12-02 16:54:14 (6072): Guest Log: [DEBUG] 12/02/16 16:54:01 recognized DC_NOP as command name, using command 60011.
2016-12-02 16:54:14 (6072): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-02 16:54:14 (6072): Guest Log: [INFO] Shutting Down.
2016-12-02 16:54:14 (6072): VM Completion File Detected.
2016-12-02 16:54:14 (6072): VM Completion Message: Could not ping HTCondor.
ID: 4414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 4415 - Posted: 2 Dec 2016, 16:21:58 UTC - in response to Message 4413.  

This whole part is missing in the sched_reply.xml when attaching with the new URL https://lhcathome.cern.ch/vLHCathome-dev/

...

Those parts are not always included in the request/reply.
Only if there were changes since the last communication between the client and the server.
ID: 4415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4416 - Posted: 2 Dec 2016, 16:57:17 UTC - in response to Message 4415.  
Last modified: 2 Dec 2016, 17:00:36 UTC

This whole part is missing in the sched_reply.xml when attaching with the new URL https://lhcathome.cern.ch/vLHCathome-dev/

...

Those parts are not always included in the request/reply.
Only if there were changes since the last communication between the client and the server.

That's probably right, so I reattached for the 20th time and looked at the differences between the last lines of the app_version part using the old and new URL:

OLD
<app_version>
..
..
<dont_throttle/>
<needs_network/>
<platform>windows_x86_64</platform>
<plan_class>vbox64_mt_mcore_cms</plan_class>
<avg_ncpus>2.000000</avg_ncpus>
<max_ncpus>2.000000</max_ncpus>
<flops>30991330939.967888</flops>
<cmdline> --memory_size_mb 3384</cmdline>
</app_version>

NEW
<app_version>
..
..
<dont_throttle/>
<needs_network/>
<platform>windows_x86_64</platform>
<plan_class>vbox64_mt_mcore_cms</plan_class>
<avg_ncpus>2.000000</avg_ncpus>
<max_ncpus>2.000000</max_ncpus>
<flops>30740549607.500664</flops>
<cmdline>--nthreads 2.000000</cmdline>
</app_version>

Conclusion: For those getting the right RAM assigned are using the old (not secure) URL (including Laurence).
ID: 4416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,397
RAC: 234
Message 4417 - Posted: 2 Dec 2016, 19:35:53 UTC - in response to Message 4416.  

Thanks for pointing this out. It is because the http and https sites are on essentially different servers and the plan class was missing. I have added it so hopefully https will work now.
ID: 4417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4418 - Posted: 2 Dec 2016, 22:10:19 UTC - in response to Message 4417.  

Thanks for pointing this out. It is because the http and https sites are on essentially different servers and the plan class was missing. I have added it so hopefully https will work now.

I tested with 5 cores. The RAM issued is 6768 MB what's in accordance with your formula (1+5)*1128MB.
ID: 4418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 4419 - Posted: 3 Dec 2016, 10:59:03 UTC

I am currently running 1 CMS-dev WU (2 cores) on each of my 2 hosts for 11 hours and I am sure they will successfully finish.
The RAM request has been sent correctly (as defined) by the server but I limited the VMs to 2944 MB to check if this is enough.

My experience according the RAM requirement is as follows:

Base (1 core): 2176 MB (2.125 GB)
add per core: 768 MB

1 core WU: 2176 MB
2 core WU: 2944 MB

Lower settings work but the VM starts using more and more internal swap (this is not necessarily bad).
Higher values lead to more free RAM inside the VM (wasted).
The numbers are slightly lower than the current formula on the server but higher than the bare minimum of Rasputin42.
It reflects the RAM requirements of the most recent WUs. Future WUs may have different needs.
ID: 4419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4424 - Posted: 3 Dec 2016, 13:28:32 UTC

Ah, I think I may have spotted why my multi-core tasks don't get assigned CMS jobs. I just tried again, and while the 2x8 slots were active on the Condor server I did an analysis of one of the jobs in the queue:

1736464.000: Run analysis summary. Of 744 machines,
16 are rejected by your job's requirements
72 reject your job because of their own requirements
654 match and are already running your jobs
0 match but are serving other users
2 are available to run your job

The Requirements expression for your job is:

( ( ( target.IS_GLIDEIN isnt true ) ||
( target.GLIDEIN_CMSSite isnt undefined ) ) &&
( GLIDEIN_REQUIRED_OS is "rhel6" || OpSysMajorVer is 6 ) ) &&
( ( Memory >= 1 ) && ( Disk >= 1 ) ) && ( TARGET.Arch == "X86_64" ) &&
( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) &&
( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )

Your job defines the following attributes:

RequestDisk = 1
RequestMemory = 2000

The Requirements expression for your job reduces to these conditions:

Slots
Step Matched Condition
----- -------- ---------
[1] 744 target.GLIDEIN_CMSSite isnt undefined
[4] 744 OpSysMajorVer is 6
[7] 744 Memory >= 1
[8] 744 Disk >= 1
[11] 744 TARGET.Arch == "X86_64"
[13] 744 TARGET.OpSys == "LINUX"
[15] 744 TARGET.Disk >= RequestDisk
[17] 728 TARGET.Memory >= RequestMemory

Suggestions:

Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( GLIDEIN_REQUIRED_OS is "rhel6" || OpSysMajorVer is 6 ) 0 REMOVE
2 ( ( Memory >= 1 ) && ( Disk >= 1 ) )0 REMOVE
3 ( TARGET.Memory >= 2000 ) 728
4 ( ( target.IS_GLIDEIN isnt true ) || ( target.GLIDEIN_CMSSite isnt undefined ) ) 744
5 ( TARGET.Arch == "X86_64" ) 744
6 ( TARGET.OpSys == "LINUX" ) 744
7 ( TARGET.Disk >= 1 ) 744
8 ( TARGET.HasFileTransfer ) 744


The formatting is screwed up by the forum display, but you should be able to see that there were 744 slots available, but only 728 of them would accept the job, 16 being rejected for not having >=2000 MB of memory available -- my 16! Now, if I look at one of my slots:

[cms005@lcggwms02:~] > condor_status -l slot8@9-1054-20424.9-1054-20424|sort|grep -i memory
DetectedMemory = 9907
Memory = 1875
TotalMemory = 15000
TotalSlotMemory = 1875
TotalVirtualMemory = 11193544
VirtualMemory = 1399193


and compare it with a slot that has a running job:

[cms005@lcggwms02:~] > condor_status -l 9819-67287-6433.9819-67287-6433|sort|grep -i memory
DetectedMemory = 1996
Memory = 3000
TotalMemory = 3000
TotalSlotMemory = 3000
TotalVirtualMemory = 3089540
VirtualMemory = 3089540


then indeed my slot does not have the requested memory! Now I just have to find out how to change that...
ID: 4424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4425 - Posted: 3 Dec 2016, 14:33:30 UTC - in response to Message 4424.  

Ah, I think I may have spotted why my multi-core tasks don't get assigned CMS jobs..
.
.
.. Now I just have to find out how to change that...

If you like, I could try an 8-core CMS-task on my Windows machine and see whether my task will also exit without running a job?
ID: 4425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 4426 - Posted: 3 Dec 2016, 14:43:38 UTC - in response to Message 4424.  

Once upon a time a couple of volunteers started a discussion about the correct RAM settings for CMS WUs.
It seems that your contribution can bring some light in the darkness if you translate it into a RAM formula that can be used by the BOINC system.

What we need are values for the following variables:
- Minimum RAM for 1 slot/core (including basic system of the VM)
- Additional RAM for each additional slot/core

;-)
ID: 4426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4427 - Posted: 3 Dec 2016, 18:02:31 UTC
Last modified: 3 Dec 2016, 18:45:21 UTC

I think, the "optimal" memory size is the middle of the difference between the bare minimum and the amount it wants to take, if you give it plenty.

For a single core tasks that would be about 2560 MB.

Comments?

EDIT: Scrap that idea, no good.
ID: 4427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4428 - Posted: 3 Dec 2016, 18:21:35 UTC - in response to Message 4426.  

OK, I set my preference to one CPU and have two tasks running. The Condor slot memory is

[cms005@lcggwms02:~] > condor_status -l 9-1054-10500.9-1054-10500|grep -i memory
VirtualMemory = 3302048
MachineResources = "Cpus Memory Disk Swap"
TotalMemory = 4500
TotalSlotMemory = 4500
TotalVirtualMemory = 3302048
DetectedMemory = 2200
Memory = 4500


Meanwhile, John Greer is running 6 slots at once, and details for one of his slots are

[cms005@lcggwms02:~] > condor_status -l slot6@314-1207-3364.314-1207-3364|grep -i memory
VirtualMemory = 1486757
MachineResources = "Cpus Memory Disk Swap"
TotalMemory = 12000
TotalSlotMemory = 2000
TotalVirtualMemory = 8920544
DetectedMemory = 7687
Memory = 2000


captainjack has a 4-slot task running on all cores

[cms005@lcggwms02:~] > condor_status -l slot4@287-1548-31691.287-1548-31691|grep -i memory
VirtualMemory = 1661374
MachineResources = "Cpus Memory Disk Swap"
TotalMemory = 9000
TotalSlotMemory = 2250
TotalVirtualMemory = 6645496
DetectedMemory = 5465
Memory = 2250


...and OLI is running 4 cores but only one is busy:
[cms005@lcggwms02:~] > condor_status -l slot3@222-361-5738.222-361-5738|grep -i memory
VirtualMemory = 1661374
MachineResources = "Cpus Memory Disk Swap"
TotalMemory = 9000
TotalSlotMemory = 2250
TotalVirtualMemory = 6645496
DetectedMemory = 5465
Memory = 2250


Curiously, his three idle slots have the same memory figures, so there must be some other constraint stopping them from running.

Finally, David Duvall has 4 slots all running:
[cms005@lcggwms02:~] > condor_status -l slot1@217-1483-19635.217-1483-19635|grep -i memory
VirtualMemory = 1661374
MachineResources = "Cpus Memory Disk Swap"
TotalMemory = 9000
TotalSlotMemory = 2250
TotalVirtualMemory = 6645496
DetectedMemory = 5465
Memory = 2250

ID: 4428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4429 - Posted: 3 Dec 2016, 18:26:48 UTC - in response to Message 4428.  
Last modified: 3 Dec 2016, 18:28:25 UTC

Looks like the formula 3 GB + 1.5 GB/core is being applied somewhere...
ID: 4429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4431 - Posted: 3 Dec 2016, 20:25:22 UTC

I tried an 8-core CMS VM without success. VM RAM allocated 10152 MB.

StartLog:
12/03/16 21:07:45 ******************************************************
12/03/16 21:07:45 ** condor_startd (CONDOR_STARTD) STARTING UP
12/03/16 21:07:45 ** /usr/sbin/condor_startd
12/03/16 21:07:45 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
12/03/16 21:07:45 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
12/03/16 21:07:45 ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $
12/03/16 21:07:45 ** $CondorPlatform: x86_64_RedHat6 $
12/03/16 21:07:45 ** PID = 4341
12/03/16 21:07:45 ** Log last touched time unavailable (No such file or directory)
12/03/16 21:07:45 ******************************************************
12/03/16 21:07:45 Using config source: /etc/condor/condor_config
12/03/16 21:07:45 Using local config sources:
12/03/16 21:07:45 /etc/condor/config.d/10_security.config
12/03/16 21:07:45 /etc/condor/config.d/14_network.config
12/03/16 21:07:45 /etc/condor/config.d/20_workernode.config
12/03/16 21:07:45 /etc/condor/config.d/30_lease.config
12/03/16 21:07:45 /etc/condor/config.d/35_cms.config
12/03/16 21:07:45 /etc/condor/config.d/40_ccb.config
12/03/16 21:07:45 /etc/condor/condor_config.local
12/03/16 21:07:45 config Macros = 156, Sorted = 156, StringBytes = 6021, TablesBytes = 5712
12/03/16 21:07:45 CLASSAD_CACHING is ENABLED
12/03/16 21:07:45 Daemon Log is logging: D_ALWAYS D_ERROR
12/03/16 21:07:45 Daemoncore: Listening at <10.0.2.15:56167> on TCP (ReliSock).
12/03/16 21:07:45 DaemonCore: command socket at <10.0.2.15:56167?addrs=10.0.2.15-56167&noUDP>
12/03/16 21:07:45 DaemonCore: private command socket at <10.0.2.15:56167?addrs=10.0.2.15-56167>
12/03/16 21:07:55 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#1384525
12/03/16 21:07:56 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state
12/03/16 21:07:56 VM-gahp server reported an internal error
12/03/16 21:07:56 VM universe will be tested to check if it is available
12/03/16 21:07:56 History file rotation is enabled.
12/03/16 21:07:56 Maximum history file size is: 20971520 bytes
12/03/16 21:07:56 Number of rotated history files is: 2
12/03/16 21:07:56 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
12/03/16 21:07:56 slot1: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot2: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot3: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot4: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot5: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot6: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot7: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 slot8: New machine resource allocated
12/03/16 21:07:56 Setting up slot pairings
12/03/16 21:07:56 CronJobList: Adding job 'mips'
12/03/16 21:07:56 CronJobList: Adding job 'kflops'
12/03/16 21:07:56 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
12/03/16 21:07:56 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
12/03/16 21:07:56 slot1: State change: IS_OWNER is false
12/03/16 21:07:56 slot1: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot1: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 BenchMgr:StartBenchmarks()
12/03/16 21:07:56 slot2: State change: IS_OWNER is false
12/03/16 21:07:56 slot2: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot2: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot2: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot3: State change: IS_OWNER is false
12/03/16 21:07:56 slot3: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot3: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot3: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot4: State change: IS_OWNER is false
12/03/16 21:07:56 slot4: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot4: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot4: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot5: State change: IS_OWNER is false
12/03/16 21:07:56 slot5: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot5: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot5: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot6: State change: IS_OWNER is false
12/03/16 21:07:56 slot6: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot6: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot6: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot7: State change: IS_OWNER is false
12/03/16 21:07:56 slot7: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot7: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot7: Changing activity: Benchmarking -> Idle
12/03/16 21:07:56 slot8: State change: IS_OWNER is false
12/03/16 21:07:56 slot8: Changing state: Owner -> Unclaimed
12/03/16 21:07:56 State change: RunBenchmarks is TRUE
12/03/16 21:07:56 slot8: Changing activity: Idle -> Benchmarking
12/03/16 21:07:56 slot8: Changing activity: Benchmarking -> Idle
12/03/16 21:08:26 State change: RunBenchmarks is TRUE
12/03/16 21:08:26 slot1: Changing activity: Benchmarking -> Idle
12/03/16 21:08:26 State change: benchmarks completed
ID: 4431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 4432 - Posted: 3 Dec 2016, 20:43:37 UTC - in response to Message 4429.  

Looks like the formula 3 GB + 1.5 GB/core is being applied somewhere...

Do you mean for the VM itself or something inside the VM?

BOINC uses for CMS the formula 1128MB + #cores*1128MB; so for a dual core 3384MB
BOINC uses for Theory the formula 630MB + #cores*100MB; so for a 4-core 1030MB
ID: 4432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4433 - Posted: 3 Dec 2016, 22:14:21 UTC - in response to Message 4432.  

Looks like the formula 3 GB + 1.5 GB/core is being applied somewhere...

Do you mean for the VM itself or something inside the VM?

BOINC uses for CMS the formula 1128MB + #cores*1128MB; so for a dual core 3384MB
BOINC uses for Theory the formula 630MB + #cores*100MB; so for a 4-core 1030MB

I was looking at the TotalMemory statistic. For the cases I quoted, that scales as the formula, and your previous post verifies what I found for eight cores, 1875 MB/slot => 15 GB total.

12/03/16 21:07:56 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%
slot type 0: Cpus: 1.000000, Memory: 1875, Swap: 12.50%, Disk: 12.50%


Since we've established that the Condor ClassAdd requires >= 2000 MB for a job to run, the maximum possible at present is 6 cores/task.

How and why these limits are set is a whole other story. Doing an Alt-F3 in the VM window of my running 1-CPU VM shows that the total memory utilisation is very close to 2 GB. Tomorrow, when my current tasks have expired, I'll submit some 6-core tasks and see what they actually use.
ID: 4433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,397
RAC: 234
Message 4434 - Posted: 3 Dec 2016, 22:45:42 UTC - in response to Message 4433.  
Last modified: 3 Dec 2016, 22:46:25 UTC

The Condor ClassAdd requires >= 2000 MB is a remnant of the general 2GB per core rule of thumb that is used in WLCG. The issue is that HTCondor will automatically assign 1 job slot per core and then splits the memory equally between them. This results in us having less than 2GB per core and hence why we don't get any jobs in some multicore VMs. So if were are to optimize the memory usage, we also need to relax that requirement to reflect what we can actually run on or go back to having 2GB per core.

As an aside, when we are happy with multicore, we can add it to lhc@home as the standard was of working just as ATLAS has done.
ID: 4434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4435 - Posted: 3 Dec 2016, 22:57:19 UTC - in response to Message 4434.  

Thanks, Laurence. Do we know where the 3 GB + N*1.5 GB for TotalMemory comes from?
ID: 4435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : Multi-core VM


©2024 CERN