Message boards : News : Suspend/Resume
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2020 - Posted: 16 Feb 2016, 19:07:08 UTC
Last modified: 16 Feb 2016, 19:10:10 UTC

The suspend/resume issue should now be resolved so it should be possible to pause/save a VM for up to 48 hours without loosing the current job. This will only work with new tasks so please start a new one if you would like to test this. As usual, please post any message to this thread if you find any problems.


EDIT: Just as a reminder the job would evicted after about suspending the VM for 20mins. If you do a test and it is fine, please also post and say for how long the VM was suspended.
ID: 2020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2040 - Posted: 17 Feb 2016, 13:55:18 UTC - in response to Message 2020.  

I suspended for 1 hour and it seemed to be fine.
ID: 2040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 2042 - Posted: 17 Feb 2016, 14:49:58 UTC

After first suspend (VM state paused 15 minutes) the job resumed fine.

CMS-dev 17 Feb 14:31:35 task wu_1455226178_527_0 suspended by user
CMS-dev 17 Feb 14:46:34 task wu_1455226178_527_0 resumed by user

Passing the former threshold of 20 minutes and pausing 22 minutes.

CMS-dev 17 Feb 15:07:45 task wu_1455226178_527_0 suspended by user
CMS-dev 17 Feb 15:29:47 task wu_1455226178_527_0 resumed by user

After the resume next lines were added before the job got SIGQUIT:
Begin processing the 120th record. Run 1, Event 166120, LumiSection 1994 at 17-Feb-2016 15:30:07.456 CET
Begin processing the 121st record. Run 1, Event 166121, LumiSection 1994 at 17-Feb-2016 15:30:21.684 CET
Begin processing the 122nd record. Run 1, Event 166122, LumiSection 1994 at 17-Feb-2016 15:30:22.933 CET
Begin processing the 123rd record. Run 1, Event 166123, LumiSection 1994 at 17-Feb-2016 15:31:29.103 CET
Begin processing the 124th record. Run 1, Event 166124, LumiSection 1994 at 17-Feb-2016 15:31:47.642 CET
Begin processing the 125th record. Run 1, Event 166125, LumiSection 1994 at 17-Feb-2016 15:31:49.570 CET


Job was killed:

02/17/16 13:51:25 (pid:19910) Using wrapper /home/boinc/CMSRun/glide_li1yhf/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_li1yhf/execute/dir_19910/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_665.txt --runAndLumis=job_lumis_665.json --lheInputFiles=False --firstEvent=166001 --firstLumi=1993 --lastEvent=166251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
02/17/16 13:51:25 (pid:19910) Running job as user (null)
02/17/16 13:51:25 (pid:19910) Create_Process succeeded, pid=19916
02/17/16 15:31:57 (pid:19910) Got SIGQUIT. Performing fast shutdown.
02/17/16 15:31:57 (pid:19910) ShutdownFast all jobs.
02/17/16 15:31:57 (pid:19910) Process exited, pid=19916, signal=9
02/17/16 15:31:57 (pid:19910) Last process exited, now Starter is exiting
ID: 2042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 2043 - Posted: 17 Feb 2016, 17:13:09 UTC

Repeated the test with a 30 minutes pause. The screen below shows the output of the Console Alt+F5 after the resume:

ID: 2043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2044 - Posted: 17 Feb 2016, 18:35:26 UTC
Last modified: 17 Feb 2016, 19:06:46 UTC

Suspend/ Resume is not working reliably.

Out of 4 suspended tasks only one resumed correctly.
The interruption was less than 15min.

EDIT the "successfully stopped VM" is missing in the stderr.

I am sometimes suspending 2 VMs at the same time.
I guess, that takes longer than the current timeout value for that.
You might want to increase that.
ID: 2044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 2045 - Posted: 17 Feb 2016, 20:56:44 UTC - in response to Message 2042.  

--jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --

OK, that's one of "mine". There's no Condor log, so it's been "lost", tho' (I think) Condor thinks it's still running (and will do so for another six days if my understanding of JobLeaseDuration" is right):
Type = "NodeStatus";
Node = "Job665";
NodeStatus = 3; /* "STATUS_SUBMITTED" */
StatusDetails = "not_idle";
RetryCount = 0;
JobProcsQueued = 1;
JobProcsHeld = 0;

ID: 2045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2046 - Posted: 17 Feb 2016, 21:28:16 UTC

Job lease duration 6 days


I understand, that if a job is lost due to exceeding the 20min timeout, a new job will be assigned and there is no way to resume the old one, as the server thinks , it is still being processed.
If that is correct, the whole 6 day job lease duration is totally pointless.
ID: 2046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 2047 - Posted: 17 Feb 2016, 22:05:58 UTC - in response to Message 2046.  

Job lease duration 6 days

I understand, that if a job is lost due to exceeding the 20min timeout, a new job will be assigned and there is no way to resume the old one, as the server thinks , it is still being processed.
If that is correct, the whole 6 day job lease duration is totally pointless.

Laurence knows more about this than I, but yes we had the start of an interesting conversation with an HTCondor expert about timeouts at today's meeting. I think an interesting remark that didn't get the attention it deserved was, "If it's part of a separate pool," which of course our current set-up is. Not sure of the vision for a future, more integrated project.
ID: 2047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 2049 - Posted: 17 Feb 2016, 22:14:09 UTC - in response to Message 2045.  

--jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --

OK, that's one of "mine".

Yeah, one job out of your batch and running on my machine and still assigned to that, killed after 22 minutes pausing by sending a new job to my machine.
No wonder that the number of 'running jobs' is going sky high.
Wallclock consumption of failed jobs higher than successful jobs.
ID: 2049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2050 - Posted: 17 Feb 2016, 22:18:06 UTC - in response to Message 2047.  
Last modified: 17 Feb 2016, 22:24:30 UTC

The timeouts are problematic, as they are interlinked. They should not be changed by much, otherwise the whole system might trip.
I would not change them by more than a factor of 2 or 3 (1/2,1/3)without knowing the details.
Followed by careful monitoring.

edit: I have four or five "lost" jobs, which will be inaccessible for 6 days.
That batch would be finished by then, oh no, it won't.
ID: 2050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2051 - Posted: 17 Feb 2016, 22:23:05 UTC - in response to Message 2046.  

It looks like there is something else killing the job. Will investigate tomorrow. For those that are interested we have been playing with these attributes.

CCB_HEARTBEAT_INTERVAL
ALIVE_INTERVAL
JobLeaseDuration

http://research.cs.wisc.edu/htcondor/manual/v8.5/3_3Configuration.html

It worked in our test instance but not with the CMS glideinWMS. Suspending jobs was discussed at a CMS meeting today and will also hopefully be discussed with the Condor developers at the workshop in Barcelona.

https://indico.cern.ch/event/467075/
ID: 2051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2052 - Posted: 17 Feb 2016, 23:42:05 UTC - in response to Message 2051.  

You might want to consider increasing this:

MAX_CLAIM_ALIVES_MISSED
The condor_schedd sends periodic updates to each condor_startd as a keep alive (see the description of ALIVE_INTERVAL on page [*]). If the condor_startd does not receive any keep alive messages, it assumes that something has gone wrong with the condor_schedd and that the resource is not being effectively used. Once this happens, the condor_startd considers the claim to have timed out, it releases the claim, and starts advertising itself as available for other jobs. Because these keep alive messages are sent via UDP, they are sometimes dropped by the network. Therefore, the condor_startd has some tolerance for missed keep alive messages, so that in case a few keep alives are lost, the condor_startd will not immediately release the claim. This setting controls how many keep alive messages can be missed before the condor_startd considers the claim no longer valid. The default is 6.
ID: 2052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 2053 - Posted: 17 Feb 2016, 23:59:23 UTC - in response to Message 2052.  

I sometimes think that no man alive can understand all the intricacies and interdependencies of HTCondor configuration.
Nor, more worryingly, can any woman!
ID: 2053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2054 - Posted: 18 Feb 2016, 0:21:01 UTC - in response to Message 2053.  

I agree, but this setting has the least potential of causing additional trouble.
ID: 2054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2056 - Posted: 18 Feb 2016, 16:26:58 UTC - in response to Message 2054.  

Now investigating this parameter

NOT_RESPONDING_TIMEOUT
ID: 2056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 2058 - Posted: 18 Feb 2016, 16:33:37 UTC - in response to Message 2020.  

I had to stop my Linux box this afternoon for a software and a memory upgrade. I stopped BOINC at 1433 and eventually restarted (why do memory DIMMS need so many attempts to seat properly?) at 1615. The VM booted and console 5 showed the tail of a CMS job that had started but not yet got to the first event. The "graphics" web-site showed just boot.log. Eventually a new CMS job started running and outputting to console 5.
I still haven't had a successful suspend/resume on the machine.
ID: 2058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2059 - Posted: 19 Feb 2016, 8:26:09 UTC - in response to Message 2058.  
Last modified: 19 Feb 2016, 8:26:26 UTC

I suspended my VM overnight and it recovered.

Begin processing the 20th record. Run 1, Event 907270, LumiSection 10888 at 18-Feb-2016 17:05:56.251 CET
Begin processing the 21st record. Run 1, Event 907271, LumiSection 10888 at 18-Feb-2016 17:06:12.386 CET
Begin processing the 22nd record. Run 1, Event 907272, LumiSection 10888 at 18-Feb-2016 17:06:12.395 CET
Begin processing the 23rd record. Run 1, Event 907273, LumiSection 10888 at 19-Feb-2016 09:22:18.384 CET
Begin processing the 24th record. Run 1, Event 907274, LumiSection 10888 at 19-Feb-2016 09:22:26.007 CET
Begin processing the 25th record. Run 1, Event 907275, LumiSection 10888 at 19-Feb-2016 09:22:59.852 CET
Begin processing the 26th record. Run 1, Event 907276, LumiSection 10888 at 19-Feb-2016 09:23:10.238 CET

Please test and let me know how it goes.
ID: 2059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2061 - Posted: 19 Feb 2016, 9:03:46 UTC - in response to Message 2059.  
Last modified: 19 Feb 2016, 9:17:04 UTC

Ignore that last message, it was killed.

EDIT: 02/19/16 09:21:29 (pid:16551) The system clocked jumped 58472 seconds unexpectedly. Restarting all daemons
ID: 2061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 2062 - Posted: 19 Feb 2016, 9:38:58 UTC - in response to Message 2061.  
Last modified: 19 Feb 2016, 9:42:34 UTC

Ignore that last message, it was killed.

You have been teased ;)
After resume it goes on with the next events in the saved job, but suddenly a new cmsRun is started.
I had up to 6 events before the running job was killed.
Maybe you have to revise your second message in this thread, although you wrote 'seemed to be fine'.
ID: 2062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2064 - Posted: 19 Feb 2016, 13:23:55 UTC - in response to Message 2062.  

static const int MAX_TIME_SKIP = (60*20); //20 minutes

:(
ID: 2064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : News : Suspend/Resume


©2024 CERN