Thread 'Suspend/Resume'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2309 - Posted: 10 Mar 2016, 22:43:21 UTC - in response to Message 2308. Thanks Crystal for testing that. Interesting. Could you post the first few lines from starter_log(after the task was resumed)? Is it an OS thing? Is it the ISP? I could try 1h30min.Why that should work, when 45min don't,it should not. ID: 2309 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2310 - Posted: 10 Mar 2016, 22:48:33 UTC - in response to Message 2302. The effort hasn't been put on hold, just moved from the fast track to the medium speed track. It is turning out to be a little more complicated than just adjusting one parameter and isn't exactly quick to test. I really appreciate all your efforts on this one! As Condor will be central to most of what we are doing, getting this issue nailed has a big impact. ID: 2310 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2311 - Posted: 10 Mar 2016, 22:50:36 UTC - in response to Message 2308. Crystal, please could you try just over 2 hours? ID: 2311 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2313 - Posted: 11 Mar 2016, 6:29:19 UTC - in response to Message 2309. Could you post the first few lines from starter_log(after the task was resumed)? Good morning (here). Logs meanwhile gone while task finished. Result with the jobs done: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117217 ID: 2313 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2314 - Posted: 11 Mar 2016, 6:36:11 UTC - in response to Message 2311. Crystal, please could you try just over 2 hours? Current task is over 6 hours and will be killed after current job, so I'll test after a new task and job has started. TTYL. CP ID: 2314 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2316 - Posted: 11 Mar 2016, 8:55:15 UTC - in response to Message 2313. I have put the log for wu_1457619778_27_0 here: http://lfield.web.cern.ch/lfield/StarterLog To me it looks like it is not resuming correctly. 03/10/16 17:52:48 (pid:7588) Create_Process succeeded, pid=7592 03/10/16 19:04:42 (pid:7588) Process exited, pid=7592, status=0 03/10/16 19:04:43 (pid:7588) Got SIGQUIT. Performing fast shutdown. 03/10/16 19:04:43 (pid:7588) ShutdownFast all jobs. 03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603 03/10/16 22:34:18 (pid:9599) Process exited, pid=9603, status=0 03/10/16 22:34:18 (pid:9599) Got SIGQUIT. Performing fast shutdown. 03/10/16 22:34:18 (pid:9599) ShutdownFast all jobs. 03/10/16 22:34:21 (pid:12131) Running job as user (null) 03/10/16 22:34:21 (pid:12131) Create_Process succeeded, pid=12135 03/11/16 00:50:47 (pid:12131) Process exited, pid=12135, status=0 03/11/16 00:50:47 (pid:12131) Got SIGQUIT. Performing fast shutdown. 03/11/16 00:50:47 (pid:12131) ShutdownFast all jobs. ID: 2316 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2320 - Posted: 11 Mar 2016, 10:03:13 UTC - in response to Message 2316. 03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603 03/10/16 22:34:18 (pid:9599) Process exited, pid=9603, status=0 03/10/16 22:34:18 (pid:9599) Got SIGQUIT. Performing fast shutdown. 03/10/16 22:34:18 (pid:9599) ShutdownFast all jobs. It resumes correctly and returns a valid job. Above is the one with the 90 minutes suspend and resume. From the result: 2016-03-10 19:13:38 (8660): VM state change detected. (old = 'running', new = 'paused') 2016-03-10 20:43:41 (8660): VM state change detected. (old = 'paused', new = 'running') (mind that the host and guest clocks are a bit out of sync) and 2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh STARTING at Thu Mar 10 18:04:46 GMT 2016 on 38-37-6640 ======== 2016-03-11 00:55:48 (8660): Guest Log: jobNumber: 9215 2016-03-11 00:55:48 (8660): Guest Log: ..... 2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: Finished remote stageout of user output files (status 0). 2016-03-11 00:55:48 (8660): Guest Log: Will not inject transfer requests to ASO for the user output files, because they were staged out directly to the permanent storage. 2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: cmscp.py FINISHING (status 0). 2016-03-11 00:55:48 (8660): Guest Log: ======== Stageout at Thu Mar 10 21:34:18 GMT 2016 FINISHING (short status 0) ======== 2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Mar 10 21:34:18 GMT 2016 on 38-37-6640 with (short) status 0 ======== 2016-03-11 00:55:48 (8660): Guest Log: Local time: Thu Mar 10 22:34:18 CET 2016 2016-03-11 00:55:48 (8660): Guest Log: Short exit status: 0 2016-03-11 00:55:48 (8660): Guest Log: Job Running time in seconds: 12572 The test with a suspend of 2 hours and 4 minutes kills the job: 03/11/16 10:41:16 (pid:7583) State change: claim lease expired (condor_schedd gone?) 03/11/16 10:41:16 (pid:7583) Changing state and activity: Claimed/Busy -> Preempting/Killing and a new cmsRun is started. ID: 2320 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2321 - Posted: 11 Mar 2016, 10:18:37 UTC - in response to Message 2320. Last modified: 11 Mar 2016, 10:21:55 UTC When i suspend a boinc tasks, the VM is stopped. 2016-03-10 18:30:23 (2416): Status Report: Job Duration: '129600.000000' 2016-03-10 18:30:23 (2416): Status Report: Elapsed Time: '6008.000799' 2016-03-10 18:30:23 (2416): Status Report: CPU Time: '5444.062500' 2016-03-10 19:40:20 (2416): Stopping VM. 2016-03-10 19:40:35 (2416): Successfully stopped VM. 2016-03-10 20:25:13 (3224): vboxwrapper (7.7.26184): starting 2016-03-10 20:25:13 (3224): Feature: Checkpoint interval offset (185 seconds) 2016-03-10 20:25:13 (3224): Detected: VirtualBox COM Interface (Version: 5.0.16) 2016-03-10 20:25:13 (3224): Detected: Anonymous Platform Enabled 2016-03-10 20:25:13 (3224): Detected: Minimum checkpoint interval (600.000000 seconds) 2016-03-10 20:25:13 (3224): Detected: Heatbeat check (file: '$s' every 0.000000 seconds) 2016-03-10 20:25:13 (3224): Starting VM. (boinc_4917bfd85daa2da7, slot#5) Why is yours just paused? ( Do you have : Leave app in memory enabled?) From the result: 2016-03-10 19:13:38 (8660): VM state change detected. (old = 'running', new = 'paused') 2016-03-10 20:43:41 (8660): VM state change detected. (old = 'paused', new = 'running') (mind that the host and guest clocks are a bit out of sync) ID: 2321 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2322 - Posted: 11 Mar 2016, 10:27:15 UTC - in response to Message 2321. Why is yours just paused? ( Do you have : Leave app in memory enabled?) That's true. Normally I would test twice (LAIM on/off), but for the moment I tested with LAIM on, because for the timeout's we're testing here, it should not make a difference, cause guest and server can't reach each other when the VM is 'only' paused. ID: 2322 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2323 - Posted: 11 Mar 2016, 10:31:09 UTC - in response to Message 2322. At least, that is different to mine. Let's see, if that causes the difference in behavior. ID: 2323 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2330 - Posted: 11 Mar 2016, 11:46:15 UTC - in response to Message 2323. Last modified: 11 Mar 2016, 11:48:04 UTC It seems LAIM (Leave application in memory while suspended) Boinc setting has a significant impact on the suspend duration,after which it correctly resumes. I enabled it, and it looks like i can suspend for longer than before. I believe, that a paused VM leaves the network ports open, while a saved VM does not. That is the difference between LAIM on/off. I still have to wait for the job to be confirmed valid,but i am certain, it will. ID: 2330 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2347 - Posted: 11 Mar 2016, 18:28:28 UTC Indeed, with LAIM off and a suspend (at 18:37:19), I see much more info after the resume (at 19:18:41) about lost connection. 03/11/16 18:12:17 (pid:10545) Running job as user (null) 03/11/16 18:12:17 (pid:10545) Create_Process succeeded, pid=10550 03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header 03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header 03/11/16 19:19:45 (pid:10545) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 03/11/16 19:19:45 (pid:10545) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 03/11/16 19:20:45 (pid:10545) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#139143 But the job runs on. Could repeat this with a suspend > 60 minutes. ID: 2347 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2348 - Posted: 11 Mar 2016, 18:41:51 UTC - in response to Message 2347. Last modified: 11 Mar 2016, 18:47:22 UTC I have tried (with LAIM on) 45min---perfectly OK. 1h55min--failed- claim expired. EDIT:I am going to try 1h45min LAIM ON ID: 2348 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2350 - Posted: 11 Mar 2016, 21:23:07 UTC Suspending 68min with LAIM off results in premature ending the job and BOINC-task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117566 ID: 2350 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2351 - Posted: 11 Mar 2016, 21:32:11 UTC Last modified: 11 Mar 2016, 21:36:31 UTC 1h45min suspend with LAIM on Failed. 03/11/16 22:15:14 (pid:7588) condor_read() failed: recv(fd=12) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:49685>. 03/11/16 22:15:14 (pid:7588) IO: Failed to read packet header 03/11/16 22:15:14 (pid:7588) AUTHENTICATE: handshake failed! 03/11/16 22:15:14 (pid:7588) DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake 03/11/16 22:15:14 (pid:7588) Starter pid 7629 exited with status 0 03/11/16 22:15:14 (pid:7588) State change: starter exited 03/11/16 22:15:14 (pid:7588) State change: No preempting claim, returning to owner 03/11/16 22:15:14 (pid:7588) Changing state and activity: Preempting/Killing -> Owner/Idle Boinc-task was ended: 1 job failed out of 1. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117664 It was the first job in that task. NOTE: IT did not run a single event after resume. ID: 2351 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2394 - Posted: 15 Mar 2016, 15:54:54 UTC - in response to Message 2351. Last modified: 15 Mar 2016, 15:58:13 UTC The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours with LAIM on. Here are the current settings. CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1) NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 The settings are shown in the glidein-stderr file from the graphics/logs. ID: 2394 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2472 - Posted: 21 Mar 2016, 12:41:01 UTC - in response to Message 2394. CLAIM_WORKLIFE set back to 3600 ID: 2472 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2476 - Posted: 21 Mar 2016, 13:35:24 UTC - in response to Message 2472. Thanks Laurence, CLAIM_WORKLIFE default is actually 1200sec (20min). Will see, how it goes. ID: 2476 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 4579 - Posted: 22 Dec 2016, 7:08:43 UTC The overnight suspending of a CMS-job did not work anymore. The yesterday evening's job (~40% done) run after the resume 20 minutes and was killed then with error code 2. StartLog: 12/22/16 07:29:26 condor_write(): Socket closed when trying to write 4006 bytes to collector vccondor01.cern.ch, fd is 9, errno=104 Connection reset by peer 12/22/16 07:29:26 Buf::write(): condor_write() failed 12/22/16 07:29:26 CCBListener: no activity from CCB server in 32440s; assuming connection is dead. 12/22/16 07:29:26 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 12/22/16 07:30:26 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601902 12/22/16 07:50:06 Starter pid 4134 exited with status 2 12/22/16 07:50:06 State change: starter exited From StarterLog: 12/22/16 07:30:06 (pid:4134) CCBListener: no activity from CCB server in 32480s; assuming connection is dead. 12/22/16 07:30:06 (pid:4134) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 12/22/16 07:30:06 (pid:4134) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>. 12/22/16 07:30:06 (pid:4134) IO: Failed to read packet header 12/22/16 07:30:06 (pid:4134) Lost connection to shadow, waiting 1200 secs for reconnect 12/22/16 07:31:06 (pid:4134) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601927 12/22/16 07:50:06 (pid:4134) No reconnect from shadow for 1200 seconds, aborting job execution! 12/22/16 07:50:06 (pid:4134) **** condor_starter (condor_STARTER) pid 4134 EXITING WITH STATUS 2 ID: 4579 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4592 - Posted: 29 Dec 2016, 23:13:14 UTC - in response to Message 4579. The overnight suspending of a CMS-job did not work anymore. Here is the long answer ... HTCondor is a distributed system composed of a number of services that communicate using ClassAds (modelled after classified adverts in newspapers). The Collector is the central information repository to which all other services register with a ClassAd. The Sched is essentially the job queue and job ClassAds are sent to the Collector. The Startd runs in the VM and sends a machine ClassAd to the Collector. The Negotiator periodically looks at the information in the Collector and matches jobs with slots on machines. When a match is made, a Shadow process if forked off from the Sched to handle interactions with the job slot. So for suspend/resume to work, settings need to be adjusted in a few places. For the new WMAgent workloads,the Sched is new and managed by CMS so these settings have not been adjusted. Initially we planned to add this to the dev project in January but as Rasputin seemed keen, we pushed it out before the annual shut down. The short answer is ... It should be working for CMS in LHC@home and is not working here as this is experimental. Thanks for reporting this and we will address it soon. ID: 4592 · Rating: 0 · rate: / Reply Quote

Development for LHC@home