Message boards :
Theory Application :
New Version v47.22
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
This version updates HTCondor to version 8.4.8 which supports the MAX_TIME_SKIP configuration attribute so we no longer have to use a binary patched library to support suspend/resume. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
What are the limits to the suspend period? I suspended a new v2.02 mt_mcore task with 4 jobs after just 8 minutes run time. I'll wait 1hr3m and see what happens with the jobs after the resume. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
After the resume (task was suspended at 18:53 with LAIM off, so VM was saved into a snapshot to disk) the 4 jobs went on and the connection was reestablished. All StarterLog.slot 1,2,3,4 similar contents: 07/21/16 18:45:23 (pid:4254) Create_Process succeeded, pid=4265 07/21/16 19:56:44 (pid:4254) CCBListener: no activity from CCB server in 3981s; assuming connection is dead. 07/21/16 19:56:44 (pid:4254) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/21/16 19:56:44 (pid:4254) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/21/16 19:56:44 (pid:4254) IO: Failed to read packet header 07/21/16 19:56:44 (pid:4254) Lost connection to shadow, waiting 86300 secs for reconnect 07/21/16 19:57:45 (pid:4254) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1165792 07/21/16 20:01:56 (pid:4254) Accepted request to reconnect from <:0> 07/21/16 20:01:56 (pid:4254) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838> 07/21/16 20:01:56 (pid:4254) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838> 07/21/16 20:01:56 (pid:4254) Recovered connection to shadow after 312 seconds |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The limit is 24h |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Since about 5.44UTC there seem to be no jobs available. BTW. Boinc-tasks should not error out, if no work is available. EDIT:And this message shows at thr top of stderr. <core_client_version>7.6.22</core_client_version> |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I did the 'overnight'-suspend test. The resume seems to be successful. 07/21/16 21:51:45 (pid:84374) Running job as user nobody 07/21/16 21:51:45 (pid:84374) Create_Process succeeded, pid=84384 07/22/16 07:07:51 (pid:84374) CCBListener: no activity from CCB server in 29767s; assuming connection is dead. 07/22/16 07:07:51 (pid:84374) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/22/16 07:07:51 (pid:84374) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/22/16 07:07:51 (pid:84374) IO: Failed to read packet header 07/22/16 07:07:51 (pid:84374) Lost connection to shadow, waiting 86300 secs for reconnect 07/22/16 07:08:51 (pid:84374) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1168778 07/22/16 07:12:46 (pid:84374) Accepted request to reconnect from <:0> 07/22/16 07:12:46 (pid:84374) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802> 07/22/16 07:12:46 (pid:84374) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802> 07/22/16 07:12:46 (pid:84374) Recovered connection to shadow after 295 seconds After the 4 jobs have finished, no new job is ordered. Maybe because no jobs available, but more likely (it was working that way in the past), because the (overnight) wall clock time is treated like elapsed time for the VM and so the 12 hours are over. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=226241 |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
Since about 5.44UTC there seem to be no jobs available. Yes, there was a problem over the last few hours with both job supply and result processing on the Theory Applicationt. This has now been fixed so things should clear up steadily. @CP, this also explains your lack of jobs after resuming... |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
@CP, this also explains your lack of jobs after resuming... Hi Ben, I don't think so. With a new started task, I directly got new jobs. 07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot2 07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot1 07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot4 07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot3 As explained the VM is treating pausing/suspend time as elapse time. A VM started at e.g. 06:00, paused at 07:00 and resumed at e.g. 18:30 will end the current saved jobs and not requesting new jobs, because the 12 hours are over and drain until the last job also finished (or 18hrs). Not a real problem, but nice to know. At least VM's now survive longer periods of suspension. Well done. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I will put the new image into production as soon as I can. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Please defragmet the image before deployment. (for conventional HDDs) Obviously, this only helps, if the volunteer keep the image defragmented as well. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
<app_version> Has absolutely no effect. I guess the wrapper cannot handle this. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I will put the new image into production as soon as I can. Repeated my longer suspend test with about 4 hours suspension, but this time all 4 saved jobs were aborted and new ones were started. 07/22/16 11:01:17 (pid:123913) Create_Process succeeded, pid=123923 07/22/16 16:24:33 (pid:123913) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/22/16 16:24:33 (pid:123913) IO: Failed to read packet header 07/22/16 16:24:33 (pid:123913) Lost connection to shadow, waiting 86300 secs for reconnect 07/22/16 16:24:33 (pid:123913) CCBListener: no activity from CCB server in 12484s; assuming connection is dead. 07/22/16 16:24:33 (pid:123913) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/22/16 16:25:34 (pid:123913) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1184889 07/22/16 16:29:47 (pid:123913) Accepted request to reconnect from <:0> 07/22/16 16:29:47 (pid:123913) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837> 07/22/16 16:29:47 (pid:123913) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837> 07/22/16 16:29:47 (pid:123913) Recovered connection to shadow after 314 seconds 07/22/16 16:29:47 (pid:123913) Got SIGTERM. Performing graceful shutdown. 07/22/16 16:29:47 (pid:123913) ShutdownGraceful all jobs. 07/22/16 16:29:47 (pid:123913) Process exited, pid=123923, signal=15 |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
It's a problem here, admittedly much worse with CMS than Theory. If a VM starts but hasn't completed a job before the host shuts down, restart is >18h later and the task often fails "Condor exited without running a job" error 206. Like this. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Again,no jobs available since about 17.00UTC. |
©2024 CERN