Message boards :
CMS Application :
New Version v47.50
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1052 Credit: 294,071 RAC: 0 ![]() |
This new version should scale the memory with the number of cores. The base memory is 2GB and 1GB extra will be added per core. |
![]() Send message Joined: 28 Jul 16 Posts: 434 Credit: 376,220 RAC: 0 ![]() ![]() |
I ran 1 WU with the following setup:
|
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
The memory extension is not working. It sticks to 2048MB. Tested it with 2 cores. |
![]() Send message Joined: 28 Jul 16 Posts: 434 Credit: 376,220 RAC: 0 ![]() ![]() |
I ran 1 WU with the following setup: I just saw that my host got the old v47.30 app instead the new v47.50. Both are listed here. |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I just saw that my host got the old v47.30 app instead the new v47.50. ... and for Windows I got the 47.40 (vbox64_mt_mcore) instead of the new 47.50 (vbox64_mt_mcore_cms). |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 674 Credit: 11,131,508 RAC: 1,793 ![]() ![]() ![]() |
I just saw that my host got the old v47.30 app instead the new v47.50. I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today. http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=192 |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today. It looks like the VM's are not booting properly. Also the setup seems not working or are you using an app_config.xml: Setting Memory Size for VM. (3000MB) Setting CPU Count for VM. (1) |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1052 Credit: 294,071 RAC: 0 ![]() |
Thanks. I have just deprecated the older versions |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Laurence, I had a task complete successfully ~0500 GMT this morning; since then they have all failed. I see several possible errors, e.g. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 enter Daemons::UpdateCollector 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Trying to update collector <130.246.180.120:9623> 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Attempting to send update via TCP to collector lcggwms02.gridpp.rl.ac.uk <130.246.180.120:9623> 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 exit Daemons::UpdateCollector 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown. ... 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 State change: benchmarks completed 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 slot1: Changing activity: Benchmarking -> Idle 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 No resources have been claimed for 300 seconds 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Shutting down Condor on this machine. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 shutdown graceful ... 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 CronJobList: Deleting all jobs 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 All resources are free, exiting. 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 **** condor_startd (condor_STARTD) pid 4300 EXITING WITH STATUS 0 2016-10-11 06:02:38 (25369): Guest Log: [ERROR] No jobs were available to run. 2016-10-11 06:02:38 (25369): Guest Log: [INFO] Shutting Down. 2016-10-11 06:02:38 (25369): VM Completion File Detected. 2016-10-11 06:02:38 (25369): VM Completion Message: No jobs were available to run. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=272506 I had a couple of failures this morning on vLHC@home on the same machine but it seems to be successfully running two jobs at the moment. https://lhcathome.cern.ch/vLHCathome/result.php?resultid=6608306 I haven't seen anything untoward on the condor server, the tasks aren't getting as far as requesting -- or at least getting! -- jobs from the server. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Magic, I see you have a job running on 1482 at present. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Thanks. I have just deprecated the older versions I just noticed I've been getting 47.30 since 0752 UTC. ![]() |
![]() Send message Joined: 28 Jul 16 Posts: 434 Credit: 376,220 RAC: 0 ![]() ![]() |
I got 2 tasks v47.30. 1 before and 1 after a project reset. Both
|
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I got a task CMS Simulation v47.40 (vbox64_mt_mcore), but on the applications site only 47.50 (vbox64_mt_mcore_cms) is available. In my prefs I configured 1 task with 2 cores. The VM created has only 1 core and 2048 MB base memory. <app_version> <app_name>CMS</app_name> <version_num>4740</version_num> <platform>windows_x86_64</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>2.000000</max_ncpus> <flops>26823773813.494980</flops> <plan_class>vbox64_mt_mcore</plan_class> <api_version>7.7.0</api_version> <cmdline>--memory_size_mb 2048</cmdline> If you want to run a VM with 2 cores avg_ncpus should be 2 and not max_ncpus |
Send message Joined: 20 Mar 15 Posts: 242 Credit: 860,312 RAC: 14 ![]() ![]() |
The 12/18 hr timeout still causes a fair bit of wasted time (and errors and frustration), but tasks behave oddly sometimes. In the course of shifting stuff to the new servers, I noticed this:- CMS 47.50 (mt, 1 core, took 3gb) Host# 553. CMS Task# 274259. From a very long stderr:- Day/ Time 14/03.08.14 Wrapper start. (Boinc task started) 14/03.15.53 Job 6492 start. 14/06.15.33 Job finished with 0. 14/06.18.12 Job 7054 start. 14/06.57.02 VM stopped. (Normal host shutdown time) 15/01.04.44 Wrapper start. (Normal host startup time) 15/01.04.46 Job finished with 0. 15/01.04.46 Job 7054 start. 15/04.45.36 Job finished with 134. 15/04.45.46 Job 755 start. 15/06.57.02 VM stopped. (Normal shutdown time) 16/01.04.44 Wrapper start. (Normal startup time) 16/01.04.46 Job finished with 134. 16/01.04.46 Job 755 start. 16/06.12.17 Job finished with 0. 16/06.14.15 Job 5986 start. 16/06.57.02 VM stopped. (Normal shutdown time) 17/01.04.43 Wrapper start. (Normal startup time) 17/01.04.45 Job finished with 0. 17/01.04.45 Job 5986 start. 17/03.22.00 Boinc finish. (Boinc task finished OK) Locating the jobs on Dashboard is a bit cumbersome; it's not obvious which CRAB task they each belong to, but looks like 6492, 7054, 755 all finished OK, first attempt. 5986 failed 61311 no second attempt. Hosts are normally started by their BIOS clock and shutdown by a cron job using the boinccmd "quit" command. After 3 mins the host is powered off ('nuther cron job), so the process should be fairly graceful. Total time ca 15h53. The shutdown time is clearly not counting towards the timeout - a welcome change, but the stopping and resuming doesn't seem to work well, what is exit code 134? with the job in hand restarted. The last job was cut short. Is this how it's supposed to work? I seem to remember that the "production" standard was performance equal to the old T4T... I don't think we're quite there yet. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
From an e-mail I sent yesterday: OK, the 134 stage-out errors seem to be mainly when the pythia job fails with an error: ... == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Your event has been rejected 10000 times! == CMSSW: EvtGen:Will now abort. == CMSSW: Complete == CMSSW: process id is 6188 status is 134 ======== CMSSW OUTPUT FINSHING ======== ... Job wrapper did not finish successfully (exit code 134). Setting that same exit code for the stageout wrapper. Stageout wrapper finished with exit code 134. Will report failure to Dashboard. ... ![]() |
©2023 CERN