Thread 'Suspend/Resume'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2020 - Posted: 16 Feb 2016, 19:07:08 UTC Last modified: 16 Feb 2016, 19:10:10 UTC The suspend/resume issue should now be resolved so it should be possible to pause/save a VM for up to 48 hours without loosing the current job. This will only work with new tasks so please start a new one if you would like to test this. As usual, please post any message to this thread if you find any problems. EDIT: Just as a reminder the job would evicted after about suspending the VM for 20mins. If you do a test and it is fine, please also post and say for how long the VM was suspended. ID: 2020 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2040 - Posted: 17 Feb 2016, 13:55:18 UTC - in response to Message 2020. I suspended for 1 hour and it seemed to be fine. ID: 2040 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1237 Credit: 960,125 RAC: 704	Message 2042 - Posted: 17 Feb 2016, 14:49:58 UTC After first suspend (VM state paused 15 minutes) the job resumed fine. CMS-dev 17 Feb 14:31:35 task wu_1455226178_527_0 suspended by user CMS-dev 17 Feb 14:46:34 task wu_1455226178_527_0 resumed by user Passing the former threshold of 20 minutes and pausing 22 minutes. CMS-dev 17 Feb 15:07:45 task wu_1455226178_527_0 suspended by user CMS-dev 17 Feb 15:29:47 task wu_1455226178_527_0 resumed by user After the resume next lines were added before the job got SIGQUIT: Begin processing the 120th record. Run 1, Event 166120, LumiSection 1994 at 17-Feb-2016 15:30:07.456 CET Begin processing the 121st record. Run 1, Event 166121, LumiSection 1994 at 17-Feb-2016 15:30:21.684 CET Begin processing the 122nd record. Run 1, Event 166122, LumiSection 1994 at 17-Feb-2016 15:30:22.933 CET Begin processing the 123rd record. Run 1, Event 166123, LumiSection 1994 at 17-Feb-2016 15:31:29.103 CET Begin processing the 124th record. Run 1, Event 166124, LumiSection 1994 at 17-Feb-2016 15:31:47.642 CET Begin processing the 125th record. Run 1, Event 166125, LumiSection 1994 at 17-Feb-2016 15:31:49.570 CET Job was killed: 02/17/16 13:51:25 (pid:19910) Using wrapper /home/boinc/CMSRun/glide_li1yhf/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_li1yhf/execute/dir_19910/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_665.txt --runAndLumis=job_lumis_665.json --lheInputFiles=False --firstEvent=166001 --firstLumi=1993 --lastEvent=166251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 02/17/16 13:51:25 (pid:19910) Running job as user (null) 02/17/16 13:51:25 (pid:19910) Create_Process succeeded, pid=19916 02/17/16 15:31:57 (pid:19910) Got SIGQUIT. Performing fast shutdown. 02/17/16 15:31:57 (pid:19910) ShutdownFast all jobs. 02/17/16 15:31:57 (pid:19910) Process exited, pid=19916, signal=9 02/17/16 15:31:57 (pid:19910) Last process exited, now Starter is exiting ID: 2042 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1237 Credit: 960,125 RAC: 704	Message 2043 - Posted: 17 Feb 2016, 17:13:09 UTC Repeated the test with a 30 minutes pause. The screen below shows the output of the Console Alt+F5 after the resume: ID: 2043 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2044 - Posted: 17 Feb 2016, 18:35:26 UTC Last modified: 17 Feb 2016, 19:06:46 UTC Suspend/ Resume is not working reliably. Out of 4 suspended tasks only one resumed correctly. The interruption was less than 15min. EDIT the "successfully stopped VM" is missing in the stderr. I am sometimes suspending 2 VMs at the same time. I guess, that takes longer than the current timeout value for that. You might want to increase that. ID: 2044 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 2045 - Posted: 17 Feb 2016, 20:56:44 UTC - in response to Message 2042. --jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 -- OK, that's one of "mine". There's no Condor log, so it's been "lost", tho' (I think) Condor thinks it's still running (and will do so for another six days if my understanding of JobLeaseDuration" is right): Type = "NodeStatus"; Node = "Job665"; NodeStatus = 3; /* "STATUS_SUBMITTED" */ StatusDetails = "not_idle"; RetryCount = 0; JobProcsQueued = 1; JobProcsHeld = 0; ID: 2045 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2046 - Posted: 17 Feb 2016, 21:28:16 UTC Job lease duration 6 days I understand, that if a job is lost due to exceeding the 20min timeout, a new job will be assigned and there is no way to resume the old one, as the server thinks , it is still being processed. If that is correct, the whole 6 day job lease duration is totally pointless. ID: 2046 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 2047 - Posted: 17 Feb 2016, 22:05:58 UTC - in response to Message 2046. Job lease duration 6 days I understand, that if a job is lost due to exceeding the 20min timeout, a new job will be assigned and there is no way to resume the old one, as the server thinks , it is still being processed. If that is correct, the whole 6 day job lease duration is totally pointless. Laurence knows more about this than I, but yes we had the start of an interesting conversation with an HTCondor expert about timeouts at today's meeting. I think an interesting remark that didn't get the attention it deserved was, "If it's part of a separate pool," which of course our current set-up is. Not sure of the vision for a future, more integrated project. ID: 2047 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1237 Credit: 960,125 RAC: 704	Message 2049 - Posted: 17 Feb 2016, 22:14:09 UTC - in response to Message 2045. --jobNumber=665 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 -- OK, that's one of "mine". Yeah, one job out of your batch and running on my machine and still assigned to that, killed after 22 minutes pausing by sending a new job to my machine. No wonder that the number of 'running jobs' is going sky high. Wallclock consumption of failed jobs higher than successful jobs. ID: 2049 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2050 - Posted: 17 Feb 2016, 22:18:06 UTC - in response to Message 2047. Last modified: 17 Feb 2016, 22:24:30 UTC The timeouts are problematic, as they are interlinked. They should not be changed by much, otherwise the whole system might trip. I would not change them by more than a factor of 2 or 3 (1/2,1/3)without knowing the details. Followed by careful monitoring. edit: I have four or five "lost" jobs, which will be inaccessible for 6 days. That batch would be finished by then, oh no, it won't. ID: 2050 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2051 - Posted: 17 Feb 2016, 22:23:05 UTC - in response to Message 2046. It looks like there is something else killing the job. Will investigate tomorrow. For those that are interested we have been playing with these attributes. CCB_HEARTBEAT_INTERVAL ALIVE_INTERVAL JobLeaseDuration http://research.cs.wisc.edu/htcondor/manual/v8.5/3_3Configuration.html It worked in our test instance but not with the CMS glideinWMS. Suspending jobs was discussed at a CMS meeting today and will also hopefully be discussed with the Condor developers at the workshop in Barcelona. https://indico.cern.ch/event/467075/ ID: 2051 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2052 - Posted: 17 Feb 2016, 23:42:05 UTC - in response to Message 2051. You might want to consider increasing this: MAX_CLAIM_ALIVES_MISSED The condor_schedd sends periodic updates to each condor_startd as a keep alive (see the description of ALIVE_INTERVAL on page [*]). If the condor_startd does not receive any keep alive messages, it assumes that something has gone wrong with the condor_schedd and that the resource is not being effectively used. Once this happens, the condor_startd considers the claim to have timed out, it releases the claim, and starts advertising itself as available for other jobs. Because these keep alive messages are sent via UDP, they are sometimes dropped by the network. Therefore, the condor_startd has some tolerance for missed keep alive messages, so that in case a few keep alives are lost, the condor_startd will not immediately release the claim. This setting controls how many keep alive messages can be missed before the condor_startd considers the claim no longer valid. The default is 6. ID: 2052 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 2053 - Posted: 17 Feb 2016, 23:59:23 UTC - in response to Message 2052. I sometimes think that no man alive can understand all the intricacies and interdependencies of HTCondor configuration. Nor, more worryingly, can any woman! ID: 2053 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2054 - Posted: 18 Feb 2016, 0:21:01 UTC - in response to Message 2053. I agree, but this setting has the least potential of causing additional trouble. ID: 2054 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2056 - Posted: 18 Feb 2016, 16:26:58 UTC - in response to Message 2054. Now investigating this parameter NOT_RESPONDING_TIMEOUT ID: 2056 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 2058 - Posted: 18 Feb 2016, 16:33:37 UTC - in response to Message 2020. I had to stop my Linux box this afternoon for a software and a memory upgrade. I stopped BOINC at 1433 and eventually restarted (why do memory DIMMS need so many attempts to seat properly?) at 1615. The VM booted and console 5 showed the tail of a CMS job that had started but not yet got to the first event. The "graphics" web-site showed just boot.log. Eventually a new CMS job started running and outputting to console 5. I still haven't had a successful suspend/resume on the machine. ID: 2058 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2059 - Posted: 19 Feb 2016, 8:26:09 UTC - in response to Message 2058. Last modified: 19 Feb 2016, 8:26:26 UTC I suspended my VM overnight and it recovered. Begin processing the 20th record. Run 1, Event 907270, LumiSection 10888 at 18-Feb-2016 17:05:56.251 CET Begin processing the 21st record. Run 1, Event 907271, LumiSection 10888 at 18-Feb-2016 17:06:12.386 CET Begin processing the 22nd record. Run 1, Event 907272, LumiSection 10888 at 18-Feb-2016 17:06:12.395 CET Begin processing the 23rd record. Run 1, Event 907273, LumiSection 10888 at 19-Feb-2016 09:22:18.384 CET Begin processing the 24th record. Run 1, Event 907274, LumiSection 10888 at 19-Feb-2016 09:22:26.007 CET Begin processing the 25th record. Run 1, Event 907275, LumiSection 10888 at 19-Feb-2016 09:22:59.852 CET Begin processing the 26th record. Run 1, Event 907276, LumiSection 10888 at 19-Feb-2016 09:23:10.238 CET Please test and let me know how it goes. ID: 2059 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2061 - Posted: 19 Feb 2016, 9:03:46 UTC - in response to Message 2059. Last modified: 19 Feb 2016, 9:17:04 UTC Ignore that last message, it was killed. EDIT: 02/19/16 09:21:29 (pid:16551) The system clocked jumped 58472 seconds unexpectedly. Restarting all daemons ID: 2061 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1237 Credit: 960,125 RAC: 704	Message 2062 - Posted: 19 Feb 2016, 9:38:58 UTC - in response to Message 2061. Last modified: 19 Feb 2016, 9:42:34 UTC Ignore that last message, it was killed. You have been teased ;) After resume it goes on with the next events in the saved job, but suddenly a new cmsRun is started. I had up to 6 events before the running job was killed. Maybe you have to revise your second message in this thread, although you wrote 'seemed to be fine'. ID: 2062 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1140 Credit: 339,231 RAC: 0	Message 2064 - Posted: 19 Feb 2016, 13:23:55 UTC - in response to Message 2062. static const int MAX_TIME_SKIP = (60*20); //20 minutes :( ID: 2064 · Rating: 0 · rate: / Reply Quote

Development for LHC@home