Message boards :
News :
Suspend/Resume
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The suspend/resume issue should now be resolved so it should be possible to pause/save a VM for up to 48 hours without loosing the current job. This will only work with new tasks so please start a new one if you would like to test this. As usual, please post any message to this thread if you find any problems. EDIT: Just as a reminder the job would evicted after about suspending the VM for 20mins. If you do a test and it is fine, please also post and say for how long the VM was suspended. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I suspended for 1 hour and it seemed to be fine. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
After first suspend (VM state paused 15 minutes) the job resumed fine. CMS-dev 17 Feb 14:31:35 task wu_1455226178_527_0 suspended by user CMS-dev 17 Feb 14:46:34 task wu_1455226178_527_0 resumed by user Passing the former threshold of 20 minutes and pausing 22 minutes. CMS-dev 17 Feb 15:07:45 task wu_1455226178_527_0 suspended by user CMS-dev 17 Feb 15:29:47 task wu_1455226178_527_0 resumed by user After the resume next lines were added before the job got SIGQUIT: Begin processing the 120th record. Run 1, Event 166120, LumiSection 1994 at 17-Feb-2016 15:30:07.456 CET Begin processing the 121st record. Run 1, Event 166121, LumiSection 1994 at 17-Feb-2016 15:30:21.684 CET Begin processing the 122nd record. Run 1, Event 166122, LumiSection 1994 at 17-Feb-2016 15:30:22.933 CET Begin processing the 123rd record. Run 1, Event 166123, LumiSection 1994 at 17-Feb-2016 15:31:29.103 CET Begin processing the 124th record. Run 1, Event 166124, LumiSection 1994 at 17-Feb-2016 15:31:47.642 CET Begin processing the 125th record. Run 1, Event 166125, LumiSection 1994 at 17-Feb-2016 15:31:49.570 CET Job was killed: 02/17/16 13:51:25 (pid:19910) Using wrapper /home/boinc/CMSRun/glide_li1yhf/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_li1yhf/execute/dir_19910/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_665.txt --runAndLumis=job_lumis_665.json --lheInputFiles=False --firstEvent=166001 --firstLumi=1993 --lastEvent=166251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 02/17/16 13:51:25 (pid:19910) Running job as user (null) 02/17/16 13:51:25 (pid:19910) Create_Process succeeded, pid=19916 02/17/16 15:31:57 (pid:19910) Got SIGQUIT. Performing fast shutdown. 02/17/16 15:31:57 (pid:19910) ShutdownFast all jobs. 02/17/16 15:31:57 (pid:19910) Process exited, pid=19916, signal=9 02/17/16 15:31:57 (pid:19910) Last process exited, now Starter is exiting |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
Repeated the test with a 30 minutes pause. The screen below shows the output of the Console Alt+F5 after the resume: |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Suspend/ Resume is not working reliably. Out of 4 suspended tasks only one resumed correctly. The interruption was less than 15min. EDIT the "successfully stopped VM" is missing in the stderr. I am sometimes suspending 2 VMs at the same time. I guess, that takes longer than the current timeout value for that. You might want to increase that. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
--jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 -- OK, that's one of "mine". There's no Condor log, so it's been "lost", tho' (I think) Condor thinks it's still running (and will do so for another six days if my understanding of JobLeaseDuration" is right): Type = "NodeStatus"; Node = "Job665"; NodeStatus = 3; /* "STATUS_SUBMITTED" */ StatusDetails = "not_idle"; RetryCount = 0; JobProcsQueued = 1; JobProcsHeld = 0; |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Job lease duration 6 days I understand, that if a job is lost due to exceeding the 20min timeout, a new job will be assigned and there is no way to resume the old one, as the server thinks , it is still being processed. If that is correct, the whole 6 day job lease duration is totally pointless. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Job lease duration 6 days Laurence knows more about this than I, but yes we had the start of an interesting conversation with an HTCondor expert about timeouts at today's meeting. I think an interesting remark that didn't get the attention it deserved was, "If it's part of a separate pool," which of course our current set-up is. Not sure of the vision for a future, more integrated project. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
--jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 -- Yeah, one job out of your batch and running on my machine and still assigned to that, killed after 22 minutes pausing by sending a new job to my machine. No wonder that the number of 'running jobs' is going sky high. Wallclock consumption of failed jobs higher than successful jobs. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The timeouts are problematic, as they are interlinked. They should not be changed by much, otherwise the whole system might trip. I would not change them by more than a factor of 2 or 3 (1/2,1/3)without knowing the details. Followed by careful monitoring. edit: I have four or five "lost" jobs, which will be inaccessible for 6 days. That batch would be finished by then, oh no, it won't. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It looks like there is something else killing the job. Will investigate tomorrow. For those that are interested we have been playing with these attributes. CCB_HEARTBEAT_INTERVAL ALIVE_INTERVAL JobLeaseDuration http://research.cs.wisc.edu/htcondor/manual/v8.5/3_3Configuration.html It worked in our test instance but not with the CMS glideinWMS. Suspending jobs was discussed at a CMS meeting today and will also hopefully be discussed with the Condor developers at the workshop in Barcelona. https://indico.cern.ch/event/467075/ |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
You might want to consider increasing this: MAX_CLAIM_ALIVES_MISSED The condor_schedd sends periodic updates to each condor_startd as a keep alive (see the description of ALIVE_INTERVAL on page [*]). If the condor_startd does not receive any keep alive messages, it assumes that something has gone wrong with the condor_schedd and that the resource is not being effectively used. Once this happens, the condor_startd considers the claim to have timed out, it releases the claim, and starts advertising itself as available for other jobs. Because these keep alive messages are sent via UDP, they are sometimes dropped by the network. Therefore, the condor_startd has some tolerance for missed keep alive messages, so that in case a few keep alives are lost, the condor_startd will not immediately release the claim. This setting controls how many keep alive messages can be missed before the condor_startd considers the claim no longer valid. The default is 6. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
I sometimes think that no man alive can understand all the intricacies and interdependencies of HTCondor configuration. Nor, more worryingly, can any woman! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I agree, but this setting has the least potential of causing additional trouble. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Now investigating this parameter NOT_RESPONDING_TIMEOUT |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
I had to stop my Linux box this afternoon for a software and a memory upgrade. I stopped BOINC at 1433 and eventually restarted (why do memory DIMMS need so many attempts to seat properly?) at 1615. The VM booted and console 5 showed the tail of a CMS job that had started but not yet got to the first event. The "graphics" web-site showed just boot.log. Eventually a new CMS job started running and outputting to console 5. I still haven't had a successful suspend/resume on the machine. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I suspended my VM overnight and it recovered. Begin processing the 20th record. Run 1, Event 907270, LumiSection 10888 at 18-Feb-2016 17:05:56.251 CET Begin processing the 21st record. Run 1, Event 907271, LumiSection 10888 at 18-Feb-2016 17:06:12.386 CET Begin processing the 22nd record. Run 1, Event 907272, LumiSection 10888 at 18-Feb-2016 17:06:12.395 CET Begin processing the 23rd record. Run 1, Event 907273, LumiSection 10888 at 19-Feb-2016 09:22:18.384 CET Begin processing the 24th record. Run 1, Event 907274, LumiSection 10888 at 19-Feb-2016 09:22:26.007 CET Begin processing the 25th record. Run 1, Event 907275, LumiSection 10888 at 19-Feb-2016 09:22:59.852 CET Begin processing the 26th record. Run 1, Event 907276, LumiSection 10888 at 19-Feb-2016 09:23:10.238 CET Please test and let me know how it goes. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Ignore that last message, it was killed. EDIT: 02/19/16 09:21:29 (pid:16551) The system clocked jumped 58472 seconds unexpectedly. Restarting all daemons |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
Ignore that last message, it was killed. You have been teased ;) After resume it goes on with the next events in the saved job, but suddenly a new cmsRun is started. I had up to 6 events before the running job was killed. Maybe you have to revise your second message in this thread, although you wrote 'seemed to be fine'. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
static const int MAX_TIME_SKIP = (60*20); //20 minutes :( |
©2024 CERN