Thread 'New Version v47.22'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3766 - Posted: 21 Jul 2016, 13:52:29 UTC This version updates HTCondor to version 8.4.8 which supports the MAX_TIME_SKIP configuration attribute so we no longer have to use a binary patched library to support suspend/resume. ID: 3766 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 3767 - Posted: 21 Jul 2016, 17:18:49 UTC What are the limits to the suspend period? I suspended a new v2.02 mt_mcore task with 4 jobs after just 8 minutes run time. I'll wait 1hr3m and see what happens with the jobs after the resume. ID: 3767 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 3768 - Posted: 21 Jul 2016, 18:37:39 UTC - in response to Message 3767. After the resume (task was suspended at 18:53 with LAIM off, so VM was saved into a snapshot to disk) the 4 jobs went on and the connection was reestablished. All StarterLog.slot 1,2,3,4 similar contents: 07/21/16 18:45:23 (pid:4254) Create_Process succeeded, pid=4265 07/21/16 19:56:44 (pid:4254) CCBListener: no activity from CCB server in 3981s; assuming connection is dead. 07/21/16 19:56:44 (pid:4254) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/21/16 19:56:44 (pid:4254) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/21/16 19:56:44 (pid:4254) IO: Failed to read packet header 07/21/16 19:56:44 (pid:4254) Lost connection to shadow, waiting 86300 secs for reconnect 07/21/16 19:57:45 (pid:4254) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1165792 07/21/16 20:01:56 (pid:4254) Accepted request to reconnect from <:0> 07/21/16 20:01:56 (pid:4254) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838> 07/21/16 20:01:56 (pid:4254) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838> 07/21/16 20:01:56 (pid:4254) Recovered connection to shadow after 312 seconds ID: 3768 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3770 - Posted: 21 Jul 2016, 22:45:40 UTC - in response to Message 3767. The limit is 24h ID: 3770 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3771 - Posted: 22 Jul 2016, 5:06:09 UTC Last modified: 22 Jul 2016, 5:21:17 UTC Since about 5.44UTC there seem to be no jobs available. BTW. Boinc-tasks should not error out, if no work is available. EDIT:And this message shows at thr top of stderr. <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> Der Ring 2-Stapel wird bereits verwendet. (0xcf) - exit code 207 (0xcf) </message> ID: 3771 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 3772 - Posted: 22 Jul 2016, 5:21:07 UTC - in response to Message 3770. Last modified: 22 Jul 2016, 5:51:53 UTC I did the 'overnight'-suspend test. The resume seems to be successful. 07/21/16 21:51:45 (pid:84374) Running job as user nobody 07/21/16 21:51:45 (pid:84374) Create_Process succeeded, pid=84384 07/22/16 07:07:51 (pid:84374) CCBListener: no activity from CCB server in 29767s; assuming connection is dead. 07/22/16 07:07:51 (pid:84374) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/22/16 07:07:51 (pid:84374) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/22/16 07:07:51 (pid:84374) IO: Failed to read packet header 07/22/16 07:07:51 (pid:84374) Lost connection to shadow, waiting 86300 secs for reconnect 07/22/16 07:08:51 (pid:84374) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1168778 07/22/16 07:12:46 (pid:84374) Accepted request to reconnect from <:0> 07/22/16 07:12:46 (pid:84374) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802> 07/22/16 07:12:46 (pid:84374) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802> 07/22/16 07:12:46 (pid:84374) Recovered connection to shadow after 295 seconds After the 4 jobs have finished, no new job is ordered. Maybe because no jobs available, but more likely (it was working that way in the past), because the (overnight) wall clock time is treated like elapsed time for the VM and so the 12 hours are over. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=226241 ID: 3772 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 3773 - Posted: 22 Jul 2016, 6:55:02 UTC - in response to Message 3771. Since about 5.44UTC there seem to be no jobs available. BTW. Boinc-tasks should not error out, if no work is available. EDIT:And this message shows at thr top of stderr. <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> Der Ring 2-Stapel wird bereits verwendet. (0xcf) - exit code 207 (0xcf) </message> Yes, there was a problem over the last few hours with both job supply and result processing on the Theory Applicationt. This has now been fixed so things should clear up steadily. @CP, this also explains your lack of jobs after resuming... ID: 3773 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 3774 - Posted: 22 Jul 2016, 9:02:19 UTC - in response to Message 3773. @CP, this also explains your lack of jobs after resuming... Hi Ben, I don't think so. With a new started task, I directly got new jobs. 07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot2 07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot1 07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot4 07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot3 As explained the VM is treating pausing/suspend time as elapse time. A VM started at e.g. 06:00, paused at 07:00 and resumed at e.g. 18:30 will end the current saved jobs and not requesting new jobs, because the 12 hours are over and drain until the last job also finished (or 18hrs). Not a real problem, but nice to know. At least VM's now survive longer periods of suspension. Well done. ID: 3774 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3775 - Posted: 22 Jul 2016, 9:50:52 UTC - in response to Message 3774. I will put the new image into production as soon as I can. ID: 3775 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3776 - Posted: 22 Jul 2016, 10:00:51 UTC - in response to Message 3775. Please defragmet the image before deployment. (for conventional HDDs) Obviously, this only helps, if the volunteer keep the image defragmented as well. ID: 3776 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3780 - Posted: 22 Jul 2016, 13:16:18 UTC <app_version> <app_name>Theory</app_name> <plan_class>vbox64_mt_mcore</plan_class> <avg_ncpus>c</avg_ncpus> <cmdline>--nthreads t</cmdline> <cmdline>--memory_size_mb mmm</cmdline> </app_version> Has absolutely no effect. I guess the wrapper cannot handle this. ID: 3780 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 3781 - Posted: 22 Jul 2016, 14:49:14 UTC - in response to Message 3775. I will put the new image into production as soon as I can. Repeated my longer suspend test with about 4 hours suspension, but this time all 4 saved jobs were aborted and new ones were started. 07/22/16 11:01:17 (pid:123913) Create_Process succeeded, pid=123923 07/22/16 16:24:33 (pid:123913) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/22/16 16:24:33 (pid:123913) IO: Failed to read packet header 07/22/16 16:24:33 (pid:123913) Lost connection to shadow, waiting 86300 secs for reconnect 07/22/16 16:24:33 (pid:123913) CCBListener: no activity from CCB server in 12484s; assuming connection is dead. 07/22/16 16:24:33 (pid:123913) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/22/16 16:25:34 (pid:123913) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1184889 07/22/16 16:29:47 (pid:123913) Accepted request to reconnect from <:0> 07/22/16 16:29:47 (pid:123913) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837> 07/22/16 16:29:47 (pid:123913) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837> 07/22/16 16:29:47 (pid:123913) Recovered connection to shadow after 314 seconds 07/22/16 16:29:47 (pid:123913) Got SIGTERM. Performing graceful shutdown. 07/22/16 16:29:47 (pid:123913) ShutdownGraceful all jobs. 07/22/16 16:29:47 (pid:123913) Process exited, pid=123923, signal=15 ID: 3781 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3782 - Posted: 22 Jul 2016, 15:14:59 UTC - in response to Message 3774. As explained the VM is treating pausing/suspend time as elapse time. A VM started at e.g. 06:00, paused at 07:00 and resumed at e.g. 18:30 will end the current saved jobs and not requesting new jobs, because the 12 hours are over and drain until the last job also finished (or 18hrs). Not a real problem, but nice to know. At least VM's now survive longer periods of suspension. Well done. It's a problem here, admittedly much worse with CMS than Theory. If a VM starts but hasn't completed a job before the host shuts down, restart is >18h later and the task often fails "Condor exited without running a job" error 206. Like this. ID: 3782 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3783 - Posted: 22 Jul 2016, 18:29:10 UTC Again,no jobs available since about 17.00UTC. ID: 3783 · Rating: 0 · rate: / Reply Quote

Development for LHC@home