Message boards : CMS Application : Multi-core VM
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Nobody else seems to be running the multi-core CMS application at the moment. I revisited it last week, having initially had start-up problems, and have worked my way up to 2x8-core tasks on one of my 20-core Xeons. But I see no other machines running like this: [cms005@lcggwms02:~] > condor_status|grep slot slot1@9-1054-19383 LINUX X86_64 Claimed Busy 1.000 3937 0+02:13:20 slot2@9-1054-19383 LINUX X86_64 Claimed Busy 1.000 3937 0+02:13:21 slot3@9-1054-19383 LINUX X86_64 Claimed Busy 0.990 3937 0+01:52:49 slot4@9-1054-19383 LINUX X86_64 Claimed Busy 1.000 3937 0+01:52:50 slot5@9-1054-19383 LINUX X86_64 Claimed Busy 1.000 3937 0+01:33:16 slot6@9-1054-19383 LINUX X86_64 Claimed Busy 1.000 3937 0+01:33:17 slot7@9-1054-19383 LINUX X86_64 Claimed Busy 0.990 3937 0+01:13:11 slot8@9-1054-19383 LINUX X86_64 Claimed Busy 1.050 3937 0+01:13:04 slot1@9-1054-22251 LINUX X86_64 Claimed Busy 1.040 3937 0+01:55:46 slot2@9-1054-22251 LINUX X86_64 Claimed Busy 1.060 3937 0+01:55:47 slot3@9-1054-22251 LINUX X86_64 Claimed Busy 1.050 3937 0+01:35:55 slot4@9-1054-22251 LINUX X86_64 Claimed Busy 1.040 3937 0+01:35:56 slot5@9-1054-22251 LINUX X86_64 Claimed Busy 1.050 3937 0+01:15:41 slot6@9-1054-22251 LINUX X86_64 Claimed Busy 1.120 3937 0+00:55:49 slot7@9-1054-22251 LINUX X86_64 Claimed Busy 1.040 3937 0+01:15:42 slot8@9-1054-22251 LINUX X86_64 Claimed Busy 1.060 3937 0+00:55:42 My app_config.xml file is <app_config> <project_max_concurrent>2</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>ALICE</name> <max_concurrent>1</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>2</max_concurrent> </app> <app_version> <app_name>CMS</app_name> <plan_class>vbox64_mt_mcore_cms</plan_class> <avg_ncpus>8.000000</avg_ncpus> <cmdline>--nthreads 8.000000</cmdline> <cmdline>--memory_size_mb 20480</cmdline> </app_version> <app> <name>LHCb</name> <max_concurrent>1</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>1</max_concurrent> </app> </app_config> My problems may have been trying to start too many VMs at once -- they complained of not being able to find the boot image. With just two VMs I'm not seeing that problem now. Is anyone else in a position to re-try the multi-core VM again now? ![]() |
![]() Send message Joined: 28 Jul 16 Posts: 516 Credit: 400,710 RAC: 70 ![]() ![]() |
Nobody else seems to be running the multi-core CMS application at the moment. ... I would, but I still get only an old version (v47.30) from the project server. :-( See: https://lhcathome.cern.ch/vLHCathome-dev/results.php?hostid=1464 |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Nobody else seems to be running the multi-core CMS application at the moment. ... Have you tried a project reset? That should force downloading a new image when you resume/update. ![]() |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
Is anyone else in a position to re-try the multi-core VM again now? I give it a try after I had reset the project. With 8 threads and 16 MB I configured to get 2 multi-core VM with each 4 cores and 6144 MB. However I got 2 tasks version: CMS Simulation v47.40 (vbox64_mt_mcore) windows_x86_64 and app_class vbox64_mt_mcore_cms is unknown. vLHCathome-dev 20 Oct 16:41:52 Entry in app_config.xml for app 'CMS', plan class 'vbox64_mt_mcore_cms' doesn't match any app versions I retried with plan class vbox64_mt_mcore and both 4-core VM's started. 4 minutes in now and cvmfs2 rather busy; TTYL |
![]() Send message Joined: 28 Jul 16 Posts: 516 Credit: 400,710 RAC: 70 ![]() ![]() |
Have you tried a project reset? ... I have. See: https://lhcathome.cern.ch/vLHCathome-dev/forum_thread.php?id=302&postid=4183 The same happened today within the regular project. My host downloaded CMS Simulations v47.42 although only v47.50 is listed: https://lhcathome.cern.ch/vLHCathome/result.php?resultid=6645841 At least the regular WU runs normal for 1.5 h now. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
I retried with plan class vbox64_mt_mcore and both 4-core VM's started. Over 45 minutes running now and both VM's are running each with 4 CMS-jobs. Jobs 5321, 5322, 5371 and 5372 in one VM and in the other jobs 5369, 5370, 5319 and 5320. On both VM's 2 jobs started about 20 minutes later than the first 2. |
Send message Joined: 22 Apr 16 Posts: 724 Credit: 2,134,635 RAC: 4,994 ![]() ![]() ![]() |
JOBS:1 CPUS:2 in -dev-preferences: Reset of project before task was starting. 47.40 was Application. https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=276361 exit-Error 207 after a few minutes. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
JOBS:1 CPUS:2 in -dev-preferences: As far as I know, a CMS-mt task will only run when using an app_config.xml with the right settings in the app_version part. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
On both VM's 2 jobs started about 20 minutes later than the first 2. Due to my own fault (exhausting the host memory) one task crashed -> https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=276244 |
![]() Send message Joined: 28 Jul 16 Posts: 516 Credit: 400,710 RAC: 70 ![]() ![]() |
As far as I know, a CMS-mt task will only run when using an app_config.xml with the right settings in the app_version part. false: If you focus on "using an app_config.xml". Apps without an app_config.xml run with standard settings. true: If you focus on "with the right settings in the app_version part". Wrong settings may lead to scheduler requests that can“t be served. |
Send message Joined: 22 Apr 16 Posts: 724 Credit: 2,134,635 RAC: 4,994 ![]() ![]() ![]() |
Helo CP, have no app_config.xml. With JOBS:1 and TASKS:1 it would run, but weeks ago. Will test today again. This message is as first message, but in German :-)) Edit: <message> Der Ring 2-Stapel wird bereits verwendet. (0xcf) - exit code 207 (0xcf) </message> |
![]() Send message Joined: 28 Jul 16 Posts: 516 Credit: 400,710 RAC: 70 ![]() ![]() |
Next try: Fr 21 Okt 2016 10:06:44 CEST | vLHCathome-dev | Resetting project |
![]() Send message Joined: 28 Jul 16 Posts: 516 Credit: 400,710 RAC: 70 ![]() ![]() |
Got 1 WU running. But still v47.30 on only 1 core with 2 GB RAM and without an app_config.xml. I will cancel the WU as v47.50 should be tested. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
I'm running another multi-core VM with 4 cores and 6144MB of memory using an app_config.xml. I'm still getting application v47.40. Again I noticed that 2 jobs are starting immediately and the 2 other jobs 20 minutes later. No idea why. Inside the VM there was 3GB RAM free. Now running with 4 jobs still 1.3 GB free of memory. |
Send message Joined: 22 Apr 16 Posts: 724 Credit: 2,134,635 RAC: 4,994 ![]() ![]() ![]() |
On both Computer this combination finished successful - but Version 47.40: https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=277159 https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=276983 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Would all volunteers please make a short post, if they are unable to run tasks with more than one core? It seems, the settings in preferences for "Max # CPUs" is completely ignored. I am running Atlas multi-core tasks without any problems. Apparently, some volunteers can, others not. Even with an app_config, it is not possible also. (Sorry,accidental double posting) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Would all volunteers please make a short post, if they are unable to run tasks with more than one core? It seems, the setting in preferences for "Max # CPUs" is completely ignored. I am running Atlas multi-core tasks without any problems. Apparently, some volunteers can, others not. Even with an app_config, it is not possible also.(boinc 7.6.33) |
Send message Joined: 22 Apr 16 Posts: 724 Credit: 2,134,635 RAC: 4,994 ![]() ![]() ![]() |
Have tested Multicore-CMS (2 CPU's). This task ended after about 12 Minutes. Boinc 7.6.22 Virtualbox 5.1.10, Computer-ID 1165 https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=290410 A reset of the project before get new task was made. |
Send message Joined: 22 Apr 16 Posts: 724 Credit: 2,134,635 RAC: 4,994 ![]() ![]() ![]() |
Have no app_config.xml. RDP-Port 51630: ALT+F2 Running job output should appear here as first line (no work?) ALT+F3 Python and CVMFS2 with most cpu shown Finished after about 12 minutes. 2016-12-01 10:45:29 (1268): Guest Log: [INFO] Reading volunteer information 2016-12-01 10:45:29 (1268): Guest Log: [INFO] Volunteer: maeax (378) Host: 1165 2016-12-01 10:45:29 (1268): Guest Log: [INFO] VMID: 6d0ae20b-f23e-4d5d-b5ca-600a8fb1d26c 2016-12-01 10:45:29 (1268): Guest Log: [INFO] Requesting an X509 credential from vLHC@home 2016-12-01 10:45:33 (1268): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2016-12-01 10:45:33 (1268): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2016-12-01 10:45:33 (1268): Guest Log: [INFO] CMS application starting. Check log files. 2016-12-01 10:45:33 (1268): Guest Log: [DEBUG] HTCondor ping 2016-12-01 10:45:33 (1268): Guest Log: [DEBUG] 0 2016-12-01 10:55:40 (1268): Guest Log: [ERROR] Condor exited after 613s without running a job. 2016-12-01 10:55:40 (1268): Guest Log: [INFO] Shutting Down. 2016-12-01 10:55:40 (1268): VM Completion File Detected. 2016-12-01 10:55:40 (1268): VM Completion Message: Condor exited after 613s without running a job. . 2016-12-01 10:55:40 (1268): Powering off VM. 2016-12-01 10:55:44 (1268): Successfully stopped VM. 2016-12-01 10:55:49 (1268): Deregistering VM. (boinc_83dffbc3ff2390d3, slot#3) 2016-12-01 10:55:49 (1268): Removing virtual disk drive(s) from VM. 2016-12-01 10:55:49 (1268): Removing network bandwidth throttle group from VM. 2016-12-01 10:55:49 (1268): Removing storage controller(s) from VM. 2016-12-01 10:55:50 (1268): Removing VM from VirtualBox. 10:55:55 (1268): called boinc_finish(206) |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
When we moved to plan classes to address the memory issue, we did not specify max_threads so it defaulted to 1 CPU. This value has now been set to 32 and it is working for Theory. It should also work for CMS but this has not been tested. |
©2025 CERN