Message boards : News : Suspend/Resume
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Even if it is set to 2 h, it depends , when these 2h start. Same with other timeouts. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as? CCB_HEARTBEAT_INTERVAL 3600 NOT_RESPONDING_TIMEOUT 7200 MAX_CLAIM_ALIVES_MISSED 6 Startd log ends:- RootDir = "/" CommittedTime = 0 03/01/16 21:50:34 (pid:7953) ERROR: Child pid 8011 appears hung! Killing it hard. 03/01/16 21:50:34 (pid:7953) State change: claim lease expired (condor_schedd gone?) 03/01/16 21:50:34 (pid:7953) Changing state and activity: Claimed/Busy -> Preempting/Killing |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I suggest to set the CCB_Heartbeat_Interval to 0. This is a testing project, so why not give it a try? |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
I was hoping that the Condor developers would get back to us but everyone is busy at the workshop. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
That's a "No" then? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I found a setting (windows8.1 x64) that finally allowed me to go past the 20min barrier, i have encountered.(May work with other Windows OSes) At the administrator command prompt: net config server /autodisconnect:-1 This disables the automatic closure of network ports, when not in use. To switch it back: net config server /autodisconnect:15 This helps, but does not solve all the issues. If the lines: CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#135485 and Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_3448> Or something similar show in Starterlog after the suspend, you can be sure, that full communication as been reestablished with the server. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Its a taking a little longer than normal to make the change. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Condor should be able to handle this and reconnect. We just need to understand the secret. We will get there, it is just taking longer than first estimated. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I found this: https://lists.cs.wisc.edu/archive/htcondor-users/2013-December/msg00112.shtml This might be related to our issue. The problem is , that the connection to the shadow is not reestablished, when the suspend time is too long. EDIT: It might be worth trying this (as in the link) 2) Set TCP keepalive on the shadow socket to be on the order of JobLeaseDuration / 2. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours. Here are the current settings. NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 The settings are shown in the glidein-stderr file from the graphics/logs. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,191 RAC: 1,488 ![]() ![]() ![]() |
Going far over the 2 hours. Stopped BOINC and powered off the host overnight. After resume this morning, the old job (4921) was killed: 03/05/16 08:33:06 (pid:7562) State change: claim lease expired (condor_schedd gone?) 03/05/16 08:33:06 (pid:7562) Changing state and activity: Claimed/Busy -> Preempting/Killing 03/05/16 08:33:06 (pid:7562) ERROR: Child pid 12453 appears hung! Killing it hard. 03/05/16 08:33:06 (pid:7562) Starter pid 12453 died on signal 9 (signal 9 (Killed)) and 03/04/16 22:42:50 (pid:12453) Create_Process succeeded, pid=12457 03/05/16 08:33:06 (pid:12453) Got SIGQUIT. Performing fast shutdown. 03/05/16 08:33:06 (pid:12453) ShutdownFast all jobs. The VM started a new glidein run and got a new job (5189). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I tested 30min suspend time: at about event 1 at about event 250 at about half way trough the upload In all cases the results were valid and a new job continued. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
After some more investigation, i think the "shadow_worklife" variable is responsible for preventing longer suspend times. The shadow has to establish contact with the host, the reverse is not possible. Failures occur, because the shadow does not resume contact after a prolonged suspend. Therefore i suggest, as a test, to change it from 1h(default) to 2h and do some tests.If it doesn't have the desired (or adverse)effect, it can be set back to default. What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays? (My best is about 35min) Or have any efforts to extend suspend times been put on hold? |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,191 RAC: 1,488 ![]() ![]() ![]() |
What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays? My message from the 1st of March → 1hr and 25 minutes |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks, Crystal Pellet. Was that a "one-off" or could you repeat that? It may be dependent on, when it was suspended during a job. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,191 RAC: 1,488 ![]() ![]() ![]() |
That was my only test between 1 and 2 hours of suspend time. I'll repeat the test with 90 minutes suspend time and I've suspended the task/job during processing of the 8th record. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks, Crystal. I would like to find out, if there is a COMMON reliable safe suspend time. I am trying 45min, as i have not had any success past 35min. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I tried 45min suspend at event 11. It finished after resuming, uploaded with status 0, but was not reported to the shadow. 03/10/16 21:21:16 (pid:11637) Buf::write(): condor_write() failed 03/10/16 21:21:16 (pid:11637) Failed to send job exit status to shadow 03/10/16 21:21:16 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/10/16 21:21:16 (pid:11637) Returning from CStarter::JobReaper() 03/10/16 21:45:37 (pid:11637) Got SIGQUIT. Performing fast shutdown. 03/10/16 21:45:37 (pid:11637) ShutdownFast all jobs. 03/10/16 21:45:37 (pid:11637) condor_write(): Socket closed when trying to write 191 bytes to <130.246.180.120:9818>, fd is 11 03/10/16 21:45:37 (pid:11637) Buf::write(): condor_write() failed 03/10/16 21:45:37 (pid:11637) Failed to send job exit status to shadow 03/10/16 21:45:37 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt The job was continued by another host. Started a new job 1h20min after the old job was resumed. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,191 RAC: 1,488 ![]() ![]() ![]() |
All fine here: vLHCathome-dev 10 Mar 19:13:31 CET task wu_1457619778_27_0 suspended by user vLHCathome-dev 10 Mar 20:43:39 CET task wu_1457619778_27_0 resumed by user Begin processing the 8th record. Run 1, Event 2303508, LumiSection 27643 at 10-Mar-2016 19:11:52.744 CET Begin processing the 9th record. Run 1, Event 2303509, LumiSection 27643 at 10-Mar-2016 20:44:06.391 CET Job ended fine and next job started normally: 03/10/16 22:34:18 CET (pid:9599) Process exited, pid=9603, status=0 03/10/16 22:34:21 CET (pid:12131) Running job as user (null) 03/10/16 22:34:21 CET (pid:12131) Create_Process succeeded, pid=12135 From Dashboard: Job 9215 finished ExitCode 0 Site T3_CH_Volunteer Retries 1 Submit 26-02 15:08:45 Start 10-03 18:05:53 Finish 10-03 21:35:10 UTC WallTime 03:29:17 |
©2025 CERN