Suspend/Resume

Author	Message
Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 327,073 RAC: 133	Message 2154 - Posted: 1 Mar 2016, 23:07:08 UTC - in response to Message 2152. I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as? ID: 2154 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2155 - Posted: 1 Mar 2016, 23:21:54 UTC - in response to Message 2154. Even if it is set to 2 h, it depends , when these 2h start. Same with other timeouts. ID: 2155 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 486	Message 2156 - Posted: 1 Mar 2016, 23:32:09 UTC - in response to Message 2154. Last modified: 1 Mar 2016, 23:53:58 UTC I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as? CCB_HEARTBEAT_INTERVAL 3600 NOT_RESPONDING_TIMEOUT 7200 MAX_CLAIM_ALIVES_MISSED 6 Startd log ends:- RootDir = "/" CommittedTime = 0 03/01/16 21:50:34 (pid:7953) ERROR: Child pid 8011 appears hung! Killing it hard. 03/01/16 21:50:34 (pid:7953) State change: claim lease expired (condor_schedd gone?) 03/01/16 21:50:34 (pid:7953) Changing state and activity: Claimed/Busy -> Preempting/Killing ID: 2156 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2158 - Posted: 2 Mar 2016, 16:36:48 UTC I suggest to set the CCB_Heartbeat_Interval to 0. Comments? This is a testing project, so why not give it a try? ID: 2158 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 327,073 RAC: 133	Message 2159 - Posted: 2 Mar 2016, 16:41:23 UTC - in response to Message 2158. I was hoping that the Condor developers would get back to us but everyone is busy at the workshop. ID: 2159 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2160 - Posted: 2 Mar 2016, 16:52:35 UTC - in response to Message 2159. That's a "No" then? ID: 2160 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2162 - Posted: 2 Mar 2016, 18:12:32 UTC I found a setting (windows8.1 x64) that finally allowed me to go past the 20min barrier, i have encountered.(May work with other Windows OSes) At the administrator command prompt: net config server /autodisconnect:-1 This disables the automatic closure of network ports, when not in use. To switch it back: net config server /autodisconnect:15 This helps, but does not solve all the issues. If the lines: CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#135485 and Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_3448> Or something similar show in Starterlog after the suspend, you can be sure, that full communication as been reestablished with the server. ID: 2162 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 327,073 RAC: 133	Message 2166 - Posted: 2 Mar 2016, 19:07:54 UTC - in response to Message 2160. Its a taking a little longer than normal to make the change. ID: 2166 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 327,073 RAC: 133	Message 2167 - Posted: 2 Mar 2016, 19:09:19 UTC - in response to Message 2162. Condor should be able to handle this and reconnect. We just need to understand the secret. We will get there, it is just taking longer than first estimated. ID: 2167 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2176 - Posted: 2 Mar 2016, 22:21:33 UTC Last modified: 2 Mar 2016, 22:48:38 UTC I found this: https://lists.cs.wisc.edu/archive/htcondor-users/2013-December/msg00112.shtml This might be related to our issue. The problem is , that the connection to the shadow is not reestablished, when the suspend time is too long. EDIT: It might be worth trying this (as in the link) 2) Set TCP keepalive on the shadow socket to be on the order of JobLeaseDuration / 2. ID: 2176 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 327,073 RAC: 133	Message 2225 - Posted: 4 Mar 2016, 15:12:52 UTC - in response to Message 2176. The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours. Here are the current settings. NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 The settings are shown in the glidein-stderr file from the graphics/logs. ID: 2225 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 2244 - Posted: 5 Mar 2016, 8:19:54 UTC Going far over the 2 hours. Stopped BOINC and powered off the host overnight. After resume this morning, the old job (4921) was killed: 03/05/16 08:33:06 (pid:7562) State change: claim lease expired (condor_schedd gone?) 03/05/16 08:33:06 (pid:7562) Changing state and activity: Claimed/Busy -> Preempting/Killing 03/05/16 08:33:06 (pid:7562) ERROR: Child pid 12453 appears hung! Killing it hard. 03/05/16 08:33:06 (pid:7562) Starter pid 12453 died on signal 9 (signal 9 (Killed)) and 03/04/16 22:42:50 (pid:12453) Create_Process succeeded, pid=12457 03/05/16 08:33:06 (pid:12453) Got SIGQUIT. Performing fast shutdown. 03/05/16 08:33:06 (pid:12453) ShutdownFast all jobs. The VM started a new glidein run and got a new job (5189). ID: 2244 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2256 - Posted: 5 Mar 2016, 23:50:29 UTC I tested 30min suspend time: at about event 1 at about event 250 at about half way trough the upload In all cases the results were valid and a new job continued. ID: 2256 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2302 - Posted: 10 Mar 2016, 15:35:36 UTC Last modified: 10 Mar 2016, 15:48:26 UTC After some more investigation, i think the "shadow_worklife" variable is responsible for preventing longer suspend times. The shadow has to establish contact with the host, the reverse is not possible. Failures occur, because the shadow does not resume contact after a prolonged suspend. Therefore i suggest, as a test, to change it from 1h(default) to 2h and do some tests.If it doesn't have the desired (or adverse)effect, it can be set back to default. What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays? (My best is about 35min) Or have any efforts to extend suspend times been put on hold? ID: 2302 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 2303 - Posted: 10 Mar 2016, 17:24:39 UTC - in response to Message 2302. What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays? (My best is about 35min) My message from the 1st of March → 1hr and 25 minutes ID: 2303 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2304 - Posted: 10 Mar 2016, 17:46:29 UTC - in response to Message 2303. Thanks, Crystal Pellet. Was that a "one-off" or could you repeat that? It may be dependent on, when it was suspended during a job. ID: 2304 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 2305 - Posted: 10 Mar 2016, 18:27:40 UTC That was my only test between 1 and 2 hours of suspend time. I'll repeat the test with 90 minutes suspend time and I've suspended the task/job during processing of the 8th record. ID: 2305 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2306 - Posted: 10 Mar 2016, 18:59:03 UTC - in response to Message 2305. Thanks, Crystal. I would like to find out, if there is a COMMON reliable safe suspend time. I am trying 45min, as i have not had any success past 35min. ID: 2306 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2307 - Posted: 10 Mar 2016, 21:11:24 UTC I tried 45min suspend at event 11. It finished after resuming, uploaded with status 0, but was not reported to the shadow. 03/10/16 21:21:16 (pid:11637) Buf::write(): condor_write() failed 03/10/16 21:21:16 (pid:11637) Failed to send job exit status to shadow 03/10/16 21:21:16 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/10/16 21:21:16 (pid:11637) Returning from CStarter::JobReaper() 03/10/16 21:45:37 (pid:11637) Got SIGQUIT. Performing fast shutdown. 03/10/16 21:45:37 (pid:11637) ShutdownFast all jobs. 03/10/16 21:45:37 (pid:11637) condor_write(): Socket closed when trying to write 191 bytes to <130.246.180.120:9818>, fd is 11 03/10/16 21:45:37 (pid:11637) Buf::write(): condor_write() failed 03/10/16 21:45:37 (pid:11637) Failed to send job exit status to shadow 03/10/16 21:45:37 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt The job was continued by another host. Started a new job 1h20min after the old job was resumed. ID: 2307 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 2308 - Posted: 10 Mar 2016, 21:56:59 UTC All fine here: vLHCathome-dev 10 Mar 19:13:31 CET task wu_1457619778_27_0 suspended by user vLHCathome-dev 10 Mar 20:43:39 CET task wu_1457619778_27_0 resumed by user Begin processing the 8th record. Run 1, Event 2303508, LumiSection 27643 at 10-Mar-2016 19:11:52.744 CET Begin processing the 9th record. Run 1, Event 2303509, LumiSection 27643 at 10-Mar-2016 20:44:06.391 CET Job ended fine and next job started normally: 03/10/16 22:34:18 CET (pid:9599) Process exited, pid=9603, status=0 03/10/16 22:34:21 CET (pid:12131) Running job as user (null) 03/10/16 22:34:21 CET (pid:12131) Create_Process succeeded, pid=12135 From Dashboard: Job 9215 finished ExitCode 0 Site T3_CH_Volunteer Retries 1 Submit 26-02 15:08:45 Start 10-03 18:05:53 Finish 10-03 21:35:10 UTC WallTime 03:29:17 ID: 2308 · Rating: 0 · rate: / Reply Quote

Development for LHC@home