Message boards : News : Suspend/Resume
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 2154 - Posted: 1 Mar 2016, 23:07:08 UTC - in response to Message 2152.  

I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as?
ID: 2154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2155 - Posted: 1 Mar 2016, 23:21:54 UTC - in response to Message 2154.  

Even if it is set to 2 h, it depends , when these 2h start. Same with other timeouts.
ID: 2155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 486
Message 2156 - Posted: 1 Mar 2016, 23:32:09 UTC - in response to Message 2154.  
Last modified: 1 Mar 2016, 23:53:58 UTC

I am surprised. That error is related to the NOT_RESPONDING_TIMEOUT parameter which is set at 2h. Can you check the glidein-stderr log to see what it was set as?

CCB_HEARTBEAT_INTERVAL 3600
NOT_RESPONDING_TIMEOUT 7200
MAX_CLAIM_ALIVES_MISSED 6

Startd log ends:-

RootDir = "/"
CommittedTime = 0
03/01/16 21:50:34 (pid:7953) ERROR: Child pid 8011 appears hung! Killing it hard.
03/01/16 21:50:34 (pid:7953) State change: claim lease expired (condor_schedd gone?)
03/01/16 21:50:34 (pid:7953) Changing state and activity: Claimed/Busy -> Preempting/Killing
ID: 2156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2158 - Posted: 2 Mar 2016, 16:36:48 UTC

I suggest to set the CCB_Heartbeat_Interval to 0.

Comments?


This is a testing project, so why not give it a try?
ID: 2158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 2159 - Posted: 2 Mar 2016, 16:41:23 UTC - in response to Message 2158.  

I was hoping that the Condor developers would get back to us but everyone is busy at the workshop.
ID: 2159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2160 - Posted: 2 Mar 2016, 16:52:35 UTC - in response to Message 2159.  

That's a "No" then?
ID: 2160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2162 - Posted: 2 Mar 2016, 18:12:32 UTC

I found a setting (windows8.1 x64) that finally allowed me to go past the 20min barrier, i have encountered.(May work with other Windows OSes)
At the administrator command prompt:

net config server /autodisconnect:-1

This disables the automatic closure of network ports, when not in use.

To switch it back:
net config server /autodisconnect:15

This helps, but does not solve all the issues.

If the lines:
CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#135485
and
Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_3448>

Or something similar show in Starterlog after the suspend, you can be sure, that full communication as been reestablished with the server.
ID: 2162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 2166 - Posted: 2 Mar 2016, 19:07:54 UTC - in response to Message 2160.  

Its a taking a little longer than normal to make the change.
ID: 2166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 2167 - Posted: 2 Mar 2016, 19:09:19 UTC - in response to Message 2162.  

Condor should be able to handle this and reconnect. We just need to understand the secret. We will get there, it is just taking longer than first estimated.
ID: 2167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2176 - Posted: 2 Mar 2016, 22:21:33 UTC
Last modified: 2 Mar 2016, 22:48:38 UTC

I found this:
https://lists.cs.wisc.edu/archive/htcondor-users/2013-December/msg00112.shtml


This might be related to our issue.

The problem is , that the connection to the shadow is not reestablished, when the suspend time is too long.

EDIT: It might be worth trying this (as in the link)
2) Set TCP keepalive on the shadow socket to be on the order of JobLeaseDuration / 2.
ID: 2176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 2225 - Posted: 4 Mar 2016, 15:12:52 UTC - in response to Message 2176.  

The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours. Here are the current settings.

NOT_RESPONDING_TIMEOUT = 10800
CCB_HEARTBEAT_INTERVAL 0
ALIVE_INTERVAL = 1800
MAX_CLAIM_ALIVES_MISSED = 6
JobLeaseDuration = 7200

The settings are shown in the glidein-stderr file from the graphics/logs.
ID: 2225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2244 - Posted: 5 Mar 2016, 8:19:54 UTC

Going far over the 2 hours. Stopped BOINC and powered off the host overnight.
After resume this morning, the old job (4921) was killed:

03/05/16 08:33:06 (pid:7562) State change: claim lease expired (condor_schedd gone?)
03/05/16 08:33:06 (pid:7562) Changing state and activity: Claimed/Busy -> Preempting/Killing
03/05/16 08:33:06 (pid:7562) ERROR: Child pid 12453 appears hung! Killing it hard.
03/05/16 08:33:06 (pid:7562) Starter pid 12453 died on signal 9 (signal 9 (Killed))


and

03/04/16 22:42:50 (pid:12453) Create_Process succeeded, pid=12457
03/05/16 08:33:06 (pid:12453) Got SIGQUIT. Performing fast shutdown.
03/05/16 08:33:06 (pid:12453) ShutdownFast all jobs.


The VM started a new glidein run and got a new job (5189).
ID: 2244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2256 - Posted: 5 Mar 2016, 23:50:29 UTC

I tested 30min suspend time:
at about event 1
at about event 250
at about half way trough the upload

In all cases the results were valid and a new job continued.
ID: 2256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2302 - Posted: 10 Mar 2016, 15:35:36 UTC
Last modified: 10 Mar 2016, 15:48:26 UTC

After some more investigation, i think the "shadow_worklife" variable is responsible for preventing longer suspend times. The shadow has to establish contact with the host, the reverse is not possible.
Failures occur, because the shadow does not resume contact after a prolonged suspend. Therefore i suggest, as a test, to change it from 1h(default) to 2h and do some tests.If it doesn't have the desired (or adverse)effect, it can be set back to default.

What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays?
(My best is about 35min)

Or have any efforts to extend suspend times been put on hold?
ID: 2302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2303 - Posted: 10 Mar 2016, 17:24:39 UTC - in response to Message 2302.  

What is the longest suspend time anyone has seen, with a job reported a success and new job starting without any significant delays?
(My best is about 35min)

My message from the 1st of March → 1hr and 25 minutes
ID: 2303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2304 - Posted: 10 Mar 2016, 17:46:29 UTC - in response to Message 2303.  

Thanks, Crystal Pellet.
Was that a "one-off" or could you repeat that?

It may be dependent on, when it was suspended during a job.
ID: 2304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2305 - Posted: 10 Mar 2016, 18:27:40 UTC

That was my only test between 1 and 2 hours of suspend time.
I'll repeat the test with 90 minutes suspend time and I've suspended the task/job during processing of the 8th record.
ID: 2305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2306 - Posted: 10 Mar 2016, 18:59:03 UTC - in response to Message 2305.  

Thanks, Crystal.
I would like to find out, if there is a COMMON reliable safe suspend time.
I am trying 45min, as i have not had any success past 35min.
ID: 2306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2307 - Posted: 10 Mar 2016, 21:11:24 UTC

I tried 45min suspend at event 11.

It finished after resuming, uploaded with status 0, but was not reported to the shadow.

03/10/16 21:21:16 (pid:11637) Buf::write(): condor_write() failed
03/10/16 21:21:16 (pid:11637) Failed to send job exit status to shadow
03/10/16 21:21:16 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
03/10/16 21:21:16 (pid:11637) Returning from CStarter::JobReaper()
03/10/16 21:45:37 (pid:11637) Got SIGQUIT. Performing fast shutdown.
03/10/16 21:45:37 (pid:11637) ShutdownFast all jobs.
03/10/16 21:45:37 (pid:11637) condor_write(): Socket closed when trying to write 191 bytes to <130.246.180.120:9818>, fd is 11
03/10/16 21:45:37 (pid:11637) Buf::write(): condor_write() failed
03/10/16 21:45:37 (pid:11637) Failed to send job exit status to shadow
03/10/16 21:45:37 (pid:11637) JobExit() failed, waiting for job lease to expire or for a reconnect attempt

The job was continued by another host.
Started a new job 1h20min after the old job was resumed.
ID: 2307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2308 - Posted: 10 Mar 2016, 21:56:59 UTC

All fine here:

vLHCathome-dev 10 Mar 19:13:31 CET task wu_1457619778_27_0 suspended by user
vLHCathome-dev 10 Mar 20:43:39 CET task wu_1457619778_27_0 resumed by user

Begin processing the 8th record. Run 1, Event 2303508, LumiSection 27643 at 10-Mar-2016 19:11:52.744 CET
Begin processing the 9th record. Run 1, Event 2303509, LumiSection 27643 at 10-Mar-2016 20:44:06.391 CET

Job ended fine and next job started normally:

03/10/16 22:34:18 CET (pid:9599) Process exited, pid=9603, status=0

03/10/16 22:34:21 CET (pid:12131) Running job as user (null)
03/10/16 22:34:21 CET (pid:12131) Create_Process succeeded, pid=12135

From Dashboard:
Job 9215 finished ExitCode 0 Site T3_CH_Volunteer Retries 1 Submit 26-02 15:08:45 Start 10-03 18:05:53 Finish 10-03 21:35:10 UTC WallTime 03:29:17
ID: 2308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : News : Suspend/Resume


©2024 CERN