Thread 'New Version v47.50'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4168 - Posted: 7 Oct 2016, 9:57:56 UTC This new version should scale the memory with the number of cores. The base memory is 2GB and 1GB extra will be added per core. ID: 4168 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 4169 - Posted: 7 Oct 2016, 14:10:33 UTC - in response to Message 4168. I ran 1 WU with the following setup: 3 cores (configured via the project webpage) no app_config.xml Result (https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=271086): The VM started with 3 cores and 2 GB RAM After 7.5 minutes I got an error 207 (EXIT_NO_SUB_TASKS) ID: 4169 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4170 - Posted: 7 Oct 2016, 17:57:59 UTC - in response to Message 4168. The memory extension is not working. It sticks to 2048MB. Tested it with 2 cores. ID: 4170 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 4171 - Posted: 8 Oct 2016, 7:03:00 UTC - in response to Message 4169. I ran 1 WU with the following setup: 3 cores (configured via the project webpage) no app_config.xml Result (https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=271086): The VM started with 3 cores and 2 GB RAM After 7.5 minutes I got an error 207 (EXIT_NO_SUB_TASKS) I just saw that my host got the old v47.30 app instead the new v47.50. Both are listed here. ID: 4171 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4172 - Posted: 8 Oct 2016, 7:35:42 UTC - in response to Message 4171. I just saw that my host got the old v47.30 app instead the new v47.50. Both are listed here. ... and for Windows I got the 47.40 (vbox64_mt_mcore) instead of the new 47.50 (vbox64_mt_mcore_cms). ID: 4172 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1000 Credit: 17,859,706 RAC: 19,223	Message 4173 - Posted: 8 Oct 2016, 20:16:08 UTC - in response to Message 4172. I just saw that my host got the old v47.30 app instead the new v47.50. Both are listed here. ... and for Windows I got the 47.40 (vbox64_mt_mcore) instead of the new 47.50 (vbox64_mt_mcore_cms). I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today. http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=192 Mad Scientist For Life ID: 4173 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4174 - Posted: 9 Oct 2016, 8:14:59 UTC - in response to Message 4173. I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today. It looks like the VM's are not booting properly. Also the setup seems not working or are you using an app_config.xml: Setting Memory Size for VM. (3000MB) Setting CPU Count for VM. (1) ID: 4174 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4179 - Posted: 11 Oct 2016, 8:10:44 UTC - in response to Message 4171. Thanks. I have just deprecated the older versions ID: 4179 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4180 - Posted: 11 Oct 2016, 8:47:16 UTC - in response to Message 4179. Laurence, I had a task complete successfully ~0500 GMT this morning; since then they have all failed. I see several possible errors, e.g. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 enter Daemons::UpdateCollector 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Trying to update collector <130.246.180.120:9623> 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Attempting to send update via TCP to collector lcggwms02.gridpp.rl.ac.uk <130.246.180.120:9623> 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 exit Daemons::UpdateCollector 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown. ... 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 State change: benchmarks completed 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 slot1: Changing activity: Benchmarking -> Idle 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 No resources have been claimed for 300 seconds 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Shutting down Condor on this machine. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown. 2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 shutdown graceful ... 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 CronJobList: Deleting all jobs 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 All resources are free, exiting. 2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 **** condor_startd (condor_STARTD) pid 4300 EXITING WITH STATUS 0 2016-10-11 06:02:38 (25369): Guest Log: [ERROR] No jobs were available to run. 2016-10-11 06:02:38 (25369): Guest Log: [INFO] Shutting Down. 2016-10-11 06:02:38 (25369): VM Completion File Detected. 2016-10-11 06:02:38 (25369): VM Completion Message: No jobs were available to run. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=272506 I had a couple of failures this morning on vLHC@home on the same machine but it seems to be successfully running two jobs at the moment. https://lhcathome.cern.ch/vLHCathome/result.php?resultid=6608306 I haven't seen anything untoward on the condor server, the tasks aren't getting as far as requesting -- or at least getting! -- jobs from the server. ID: 4180 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4181 - Posted: 11 Oct 2016, 9:48:14 UTC - in response to Message 4173. Last modified: 11 Oct 2016, 9:48:44 UTC Magic, I see you have a job running on 1482 at present. ID: 4181 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4182 - Posted: 11 Oct 2016, 10:09:55 UTC - in response to Message 4179. Thanks. I have just deprecated the older versions I just noticed I've been getting 47.30 since 0752 UTC. ID: 4182 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 4183 - Posted: 11 Oct 2016, 13:59:15 UTC I got 2 tasks v47.30. 1 before and 1 after a project reset. Both run 3 cores (as configured) use 2 GB RAM run into an error 207 after a few minutes https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=273547 https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=273607 ID: 4183 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4184 - Posted: 11 Oct 2016, 14:46:18 UTC I got a task CMS Simulation v47.40 (vbox64_mt_mcore), but on the applications site only 47.50 (vbox64_mt_mcore_cms) is available. In my prefs I configured 1 task with 2 cores. The VM created has only 1 core and 2048 MB base memory. <app_version> <app_name>CMS</app_name> <version_num>4740</version_num> <platform>windows_x86_64</platform> <avg_ncpus>1.000000</avg_ncpus> <max_ncpus>2.000000</max_ncpus> <flops>26823773813.494980</flops> <plan_class>vbox64_mt_mcore</plan_class> <api_version>7.7.0</api_version> <cmdline>--memory_size_mb 2048</cmdline> If you want to run a VM with 2 cores avg_ncpus should be 2 and not max_ncpus ID: 4184 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 4197 - Posted: 19 Oct 2016, 14:23:40 UTC Last modified: 19 Oct 2016, 14:35:59 UTC The 12/18 hr timeout still causes a fair bit of wasted time (and errors and frustration), but tasks behave oddly sometimes. In the course of shifting stuff to the new servers, I noticed this:- CMS 47.50 (mt, 1 core, took 3gb) Host# 553. CMS Task# 274259. From a very long stderr:- Day/ Time 14/03.08.14 Wrapper start. (Boinc task started) 14/03.15.53 Job 6492 start. 14/06.15.33 Job finished with 0. 14/06.18.12 Job 7054 start. 14/06.57.02 VM stopped. (Normal host shutdown time) 15/01.04.44 Wrapper start. (Normal host startup time) 15/01.04.46 Job finished with 0. 15/01.04.46 Job 7054 start. 15/04.45.36 Job finished with 134. 15/04.45.46 Job 755 start. 15/06.57.02 VM stopped. (Normal shutdown time) 16/01.04.44 Wrapper start. (Normal startup time) 16/01.04.46 Job finished with 134. 16/01.04.46 Job 755 start. 16/06.12.17 Job finished with 0. 16/06.14.15 Job 5986 start. 16/06.57.02 VM stopped. (Normal shutdown time) 17/01.04.43 Wrapper start. (Normal startup time) 17/01.04.45 Job finished with 0. 17/01.04.45 Job 5986 start. 17/03.22.00 Boinc finish. (Boinc task finished OK) Locating the jobs on Dashboard is a bit cumbersome; it's not obvious which CRAB task they each belong to, but looks like 6492, 7054, 755 all finished OK, first attempt. 5986 failed 61311 no second attempt. Hosts are normally started by their BIOS clock and shutdown by a cron job using the boinccmd "quit" command. After 3 mins the host is powered off ('nuther cron job), so the process should be fairly graceful. Total time ca 15h53. The shutdown time is clearly not counting towards the timeout - a welcome change, but the stopping and resuming doesn't seem to work well, what is exit code 134? with the job in hand restarted. The last job was cut short. Is this how it's supposed to work? I seem to remember that the "production" standard was performance equal to the old T4T... I don't think we're quite there yet. ID: 4197 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4198 - Posted: 19 Oct 2016, 14:40:14 UTC - in response to Message 4197. From an e-mail I sent yesterday: OK, the 134 stage-out errors seem to be mainly when the pythia job fails with an error: ... == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Your event has been rejected 10000 times! == CMSSW: EvtGen:Will now abort. == CMSSW: Complete == CMSSW: process id is 6188 status is 134 ======== CMSSW OUTPUT FINSHING ======== ... Job wrapper did not finish successfully (exit code 134). Setting that same exit code for the stageout wrapper. Stageout wrapper finished with exit code 134. Will report failure to Dashboard. ... ID: 4198 · Rating: 0 · rate: / Reply Quote

Development for LHC@home