Suspend/Resume Theory

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3287 - Posted: 6 May 2016, 16:02:45 UTC - in response to Message 3286. Suspend/Resume should now work upto 60mins. Testing without Theory jobs running inside the VM makes no sense IMO. We are out of jobs: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=209 Is it not possible to create a steady flow without manual intervention? ID: 3287 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0	Message 3288 - Posted: 6 May 2016, 17:28:20 UTC - in response to Message 3287. For CMS, there are no new jobs following an exit and resume. ID: 3288 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0	Message 3289 - Posted: 6 May 2016, 20:42:06 UTC Last modified: 6 May 2016, 21:06:20 UTC Theory Task running a little over an hour. 4 jobs completed. Job in progress 37900 events. LAIM set. Suspended in Boinc for 13 mins. VM showed Paused in VBox. Resumed Task. Picked up where it left off without any delay 8¬) 5 mins later, job at 59700, Exited Boinc. VM saved in VBox. Restarted Boinc. A minute or so to Restore and reconnect Console. Job continuing unaffected. Woo Hoo. Good job, guys. The standard swap-out time for Boinc to run another Project is 60 mins so it would be good if Tasks could be Suspended a little beyond that just to allow for any changeover delays. -------------- Hmm, maybe spoke too soon. Job I interfered with above has finished. Looks OK in Console (Run finished successfully) but no new job has arrived yet and the log still shows it as a running.log. srderr shows: 20:02:37 +0100 2016-05-06 [INFO] New Job Starting 20:02:37 +0100 2016-05-06 [INFO] Condor JobID: 297910 20:02:42 +0100 2016-05-06 [INFO] MCPlots JobID: 30109085 20:39:32 +0100 2016-05-06 [INFO] Job finished with . (Which looks OK) StarterLog shows: 05/06/16 20:02:37 Running job as user nobody 05/06/16 20:02:37 Create_Process succeeded, pid=17651 No mention of the Suspend/Resume (and no other entries) in this time period 20:02 - 20:31 but after the Boinc exit/restart, even though the job APPEARED to be unaffected, it doesn't look like it enjoyed the Restart. The log continues: 05/06/16 20:31:18 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/06/16 20:31:18 IO: Failed to read packet header 05/06/16 20:36:19 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:36:19 Buf::write(): condor_write() failed 05/06/16 20:39:32 Process exited, pid=17651, status=0 05/06/16 20:39:32 About to exec Post script: /var/lib/condor/execute/dir_17647/tarOutput.sh 2016-542355-247 05/06/16 20:39:32 Create_Process succeeded, pid=22628 05/06/16 20:39:33 Process exited, pid=22628, status=0 05/06/16 20:39:33 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:39:33 Buf::write(): condor_write() failed 05/06/16 20:39:35 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:39:35 Buf::write(): condor_write() failed 05/06/16 20:39:35 Failed to send job exit status to shadow 05/06/16 20:39:35 JobExit() failed, waiting for job lease to expire or for a reconnect attempt Still no connection 20 mins later so I attempted another Boinc restart to see if that would force a retry. Console opens to the same "Run finished successfully" message, but now I can't access the logs. Didn't know how long it would sit like that so I've Reset the VM in VBox. Started up OK but no joy and task timed-out while I wasn't looking. So.... Suspend/Resume OK but Exit/Restart not. New replacement Task didn't get any jobs either and timed-out so maybe they've all gone but it's too late in the day for any more fiddling anyway. Hope the above is understandable and useful. --------- 5 failed jobs from that host today so I'll maybe check that figure before doing any further interference 2moro. Other hosts are full of CMS Tasks just now. ID: 3289 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3291 - Posted: 7 May 2016, 18:49:34 UTC - in response to Message 3286. Suspend/Resume should now work upto 60mins. Tested with LAIM on -> 55 minutes OK The resumed job runs on and after the upload a new job was received. ID: 3291 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0	Message 3292 - Posted: 8 May 2016, 6:57:09 UTC - in response to Message 3291. Last modified: 8 May 2016, 6:59:54 UTC 74mins suspended and all's well after Resume so hopefully that's that sorted although testing with LAIM off might be useful too, just for completeness. Not tried Exit/Restart again. Might do that later 2day when I've some time to watch closely. ID: 3292 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0	Message 3294 - Posted: 9 May 2016, 11:50:17 UTC - in response to Message 3292. Should be good upto 3 hours now. ID: 3294 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3295 - Posted: 9 May 2016, 18:06:34 UTC - in response to Message 3294. Last modified: 9 May 2016, 18:07:24 UTC Should be good upto 3 hours now. Should be, but isn't: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=172162 I resumed the VM after a suspend with LAIM on for 2.5 hours. It processed a few 100 events and then: 2016-05-09 20:02:31 (5912): Guest Log: [INFO] Condor exited with return value 99. 2016-05-09 20:02:31 (5912): Guest Log: [INFO] Shutting Down. ID: 3295 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3309 - Posted: 11 May 2016, 9:11:34 UTC Last modified: 11 May 2016, 9:34:54 UTC Retested with 65 minutes pausing and LAIM on. Task runs on after resume and after the job finished it seemly failed to upload the result: 05/11/16 10:27:39 Process exited, pid=4334, status=0 05/11/16 10:27:39 About to exec Post script: /var/lib/condor/execute/dir_4330/tarOutput.sh /var/spool/job-staging/2016-563005-252.run 05/11/16 10:27:39 Create_Process succeeded, pid=10097 05/11/16 10:27:39 Process exited, pid=10097, status=2 05/11/16 10:27:39 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_4330/2016-563005-252.tgz': No such file or directory (errno: 2, si_error: 1) 05/11/16 10:27:39 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_4330/2016-563005-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:50862> 05/11/16 10:27:39 JICShadow::notifyJobTermination(): Sending mock terminate event. 05/11/16 10:27:40 JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt 05/11/16 10:27:40 Got SIGQUIT. Performing fast shutdown. 05/11/16 10:27:40 ShutdownFast all jobs. Condor was restarted and a new job ran, but also that result failed to upload. Logs available. Edit: A shorter pausing task (22 min) also has a problem after the job finished. 05/11/16 11:18:52 Process exited, pid=26782, status=0 05/11/16 11:18:52 About to exec Post script: /var/lib/condor/execute/dir_26778/tarOutput.sh /var/spool/job-staging/2016-570169-252.run 05/11/16 11:18:52 Create_Process succeeded, pid=28285 05/11/16 11:18:52 Process exited, pid=28285, status=2 05/11/16 11:18:52 condor_write(): Socket closed when trying to write 582 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:52 Buf::write(): condor_write() failed 05/11/16 11:18:53 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_26778/2016-570169-252.tgz': No such file or directory (errno: 2, si_error: 1) 05/11/16 11:18:53 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_26778/2016-570169-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:51321> 05/11/16 11:18:53 JICShadow::notifyJobTermination(): Sending mock terminate event. 05/11/16 11:18:53 condor_write(): Socket closed when trying to write 316 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:53 Buf::write(): condor_write() failed 05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 342 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp 05/11/16 11:18:53 condor_write(): Socket closed when trying to write 198 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:53 Buf::write(): condor_write() failed 05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 902 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp ID: 3309 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3310 - Posted: 11 May 2016, 10:38:16 UTC The VM mentioned in the former post after last job was idling over an hour. No new job and also no forced shutdown. To get rid of this laziness, I rebooted the VM. After the reboot a new job started. ID: 3310 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 3311 - Posted: 11 May 2016, 11:09:13 UTC - in response to Message 3309. We are working on the current problem with uploading results. ID: 3311 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 3316 - Posted: 12 May 2016, 7:14:23 UTC - in response to Message 3311. We are working on the current problem with uploading results. Problem now solved - system should be working well again. ID: 3316 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3319 - Posted: 12 May 2016, 10:29:48 UTC - in response to Message 3316. We are working on the current problem with uploading results. Problem now solved - system should be working well again. Confirmed! Thanks, Ben. ID: 3319 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3321 - Posted: 12 May 2016, 19:29:26 UTC - in response to Message 3294. Should be good upto 3 hours now. A short test with 61 minutes pause. The job runs on after the resume and also next jobs running fine. The only info found was in MasterLog: 05/12/16 15:18:50 Preen pid is 4752 05/12/16 15:18:50 DefaultReaper unexpectedly called on pid 4752, status 0. ID: 3321 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0	Message 3333 - Posted: 13 May 2016, 9:54:37 UTC - in response to Message 3321. I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for? ID: 3333 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3336 - Posted: 13 May 2016, 10:32:34 UTC - in response to Message 3333. Last modified: 13 May 2016, 10:41:56 UTC I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for? Not everyone has the habit to start BOINC first before breakfast, so job/task-survival of up to 12 hours pausing would be fine in my opinion. ID: 3336 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0	Message 3338 - Posted: 13 May 2016, 12:47:01 UTC Are we still just talking about Suspend/Resume or do jobs now survive a Boinc Exit? I've not been able to do any testing in that regard this week and it'll be evening before I'll be able to try. ID: 3338 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0	Message 3341 - Posted: 13 May 2016, 12:51:26 UTC - in response to Message 3338. It depends what happens on a BOINC exit. If it saves to disk then it might work but I haven't tested it. My tests have been related to just pausing the VM. The StarterLog is most informative if it doesn't work. ID: 3341 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3345 - Posted: 13 May 2016, 14:21:36 UTC If the BOINC client is exited normally a snapshot of the VM will be saved in the slots/X/boinc_xxxxxxxxxxxxxxxx/Snapshots directory. The same happens when you suspend a task with LAIM off. However when you have several VM's running controlled by BOINC, all those VM's should be saved to disk within 60 seconds, else tasks too late saved, will be treated like a computation error. Resuming the task and thus restoring the VM from disk is working fine, but I didn't tested it recently for longer periods of inactivity. I'm testing it now. Which period of pausing shall I test first before doing the overnight test? ID: 3345 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3348 - Posted: 13 May 2016, 17:12:51 UTC Last modified: 13 May 2016, 17:16:12 UTC Disconnecting the internet for 30min causes still -Job to finish and upload -no new job is started 05/13/16 18:29:04 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed. 05/13/16 18:29:04 Failed to start non-blocking update to <188.184.129.127:9618>. 05/13/16 18:29:34 attempt to connect to <188.184.129.127:9618> failed: No route to host (connect errno = 113). 05/13/16 18:29:34 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed. 05/13/16 18:29:34 Failed to start non-blocking update to <188.184.129.127:9618>. is repeated over and over first message of this appeared 30 sec after disconnect in start.log This indicates, that even a very short internet disconnect causes the task to hang (forever?) ID: 3348 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9	Message 3350 - Posted: 13 May 2016, 17:17:48 UTC I suspended the task with LAIM off for 3 hours and 5 minutes. After the resume the VM was shutdown within one minute. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173962 2016-05-13 19:12:01 (7368): VM Completion Message: Condor exited with return value 99. ID: 3350 · Rating: 0 · rate: / Reply Quote

Development for LHC@home