Message boards :
Theory Application :
Suspend/Resume Theory
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Suspend/Resume should now work upto 60mins. Testing without Theory jobs running inside the VM makes no sense IMO. We are out of jobs: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=209 Is it not possible to create a steady flow without manual intervention? |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
For CMS, there are no new jobs following an exit and resume. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 9 |
Theory Task running a little over an hour. 4 jobs completed. Job in progress 37900 events. LAIM set. Suspended in Boinc for 13 mins. VM showed Paused in VBox. Resumed Task. Picked up where it left off without any delay 8¬) 5 mins later, job at 59700, Exited Boinc. VM saved in VBox. Restarted Boinc. A minute or so to Restore and reconnect Console. Job continuing unaffected. Woo Hoo. Good job, guys. The standard swap-out time for Boinc to run another Project is 60 mins so it would be good if Tasks could be Suspended a little beyond that just to allow for any changeover delays. -------------- Hmm, maybe spoke too soon. Job I interfered with above has finished. Looks OK in Console (Run finished successfully) but no new job has arrived yet and the log still shows it as a running.log. srderr shows: 20:02:37 +0100 2016-05-06 [INFO] New Job Starting 20:02:37 +0100 2016-05-06 [INFO] Condor JobID: 297910 20:02:42 +0100 2016-05-06 [INFO] MCPlots JobID: 30109085 20:39:32 +0100 2016-05-06 [INFO] Job finished with . (Which looks OK) StarterLog shows: 05/06/16 20:02:37 Running job as user nobody 05/06/16 20:02:37 Create_Process succeeded, pid=17651 No mention of the Suspend/Resume (and no other entries) in this time period 20:02 - 20:31 but after the Boinc exit/restart, even though the job APPEARED to be unaffected, it doesn't look like it enjoyed the Restart. The log continues: 05/06/16 20:31:18 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/06/16 20:31:18 IO: Failed to read packet header 05/06/16 20:36:19 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:36:19 Buf::write(): condor_write() failed 05/06/16 20:39:32 Process exited, pid=17651, status=0 05/06/16 20:39:32 About to exec Post script: /var/lib/condor/execute/dir_17647/tarOutput.sh 2016-542355-247 05/06/16 20:39:32 Create_Process succeeded, pid=22628 05/06/16 20:39:33 Process exited, pid=22628, status=0 05/06/16 20:39:33 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:39:33 Buf::write(): condor_write() failed 05/06/16 20:39:35 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11 05/06/16 20:39:35 Buf::write(): condor_write() failed 05/06/16 20:39:35 Failed to send job exit status to shadow 05/06/16 20:39:35 JobExit() failed, waiting for job lease to expire or for a reconnect attempt Still no connection 20 mins later so I attempted another Boinc restart to see if that would force a retry. Console opens to the same "Run finished successfully" message, but now I can't access the logs. Didn't know how long it would sit like that so I've Reset the VM in VBox. Started up OK but no joy and task timed-out while I wasn't looking. So.... Suspend/Resume OK but Exit/Restart not. New replacement Task didn't get any jobs either and timed-out so maybe they've all gone but it's too late in the day for any more fiddling anyway. Hope the above is understandable and useful. --------- 5 failed jobs from that host today so I'll maybe check that figure before doing any further interference 2moro. Other hosts are full of CMS Tasks just now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Suspend/Resume should now work upto 60mins. Tested with LAIM on -> 55 minutes OK The resumed job runs on and after the upload a new job was received. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 9 |
74mins suspended and all's well after Resume so hopefully that's that sorted although testing with LAIM off might be useful too, just for completeness. Not tried Exit/Restart again. Might do that later 2day when I've some time to watch closely. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Should be good upto 3 hours now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Should be good upto 3 hours now. Should be, but isn't: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=172162 I resumed the VM after a suspend with LAIM on for 2.5 hours. It processed a few 100 events and then: 2016-05-09 20:02:31 (5912): Guest Log: [INFO] Condor exited with return value 99. 2016-05-09 20:02:31 (5912): Guest Log: [INFO] Shutting Down. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Retested with 65 minutes pausing and LAIM on. Task runs on after resume and after the job finished it seemly failed to upload the result: 05/11/16 10:27:39 Process exited, pid=4334, status=0 05/11/16 10:27:39 About to exec Post script: /var/lib/condor/execute/dir_4330/tarOutput.sh /var/spool/job-staging/2016-563005-252.run 05/11/16 10:27:39 Create_Process succeeded, pid=10097 05/11/16 10:27:39 Process exited, pid=10097, status=2 05/11/16 10:27:39 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_4330/2016-563005-252.tgz': No such file or directory (errno: 2, si_error: 1) 05/11/16 10:27:39 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_4330/2016-563005-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:50862> 05/11/16 10:27:39 JICShadow::notifyJobTermination(): Sending mock terminate event. 05/11/16 10:27:40 JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt 05/11/16 10:27:40 Got SIGQUIT. Performing fast shutdown. 05/11/16 10:27:40 ShutdownFast all jobs. Condor was restarted and a new job ran, but also that result failed to upload. Logs available. Edit: A shorter pausing task (22 min) also has a problem after the job finished. 05/11/16 11:18:52 Process exited, pid=26782, status=0 05/11/16 11:18:52 About to exec Post script: /var/lib/condor/execute/dir_26778/tarOutput.sh /var/spool/job-staging/2016-570169-252.run 05/11/16 11:18:52 Create_Process succeeded, pid=28285 05/11/16 11:18:52 Process exited, pid=28285, status=2 05/11/16 11:18:52 condor_write(): Socket closed when trying to write 582 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:52 Buf::write(): condor_write() failed 05/11/16 11:18:53 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_26778/2016-570169-252.tgz': No such file or directory (errno: 2, si_error: 1) 05/11/16 11:18:53 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_26778/2016-570169-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:51321> 05/11/16 11:18:53 JICShadow::notifyJobTermination(): Sending mock terminate event. 05/11/16 11:18:53 condor_write(): Socket closed when trying to write 316 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:53 Buf::write(): condor_write() failed 05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 342 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp 05/11/16 11:18:53 condor_write(): Socket closed when trying to write 198 bytes to <188.184.187.167:9618>, fd is 11 05/11/16 11:18:53 Buf::write(): condor_write() failed 05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 902 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
The VM mentioned in the former post after last job was idling over an hour. No new job and also no forced shutdown. To get rid of this laziness, I rebooted the VM. After the reboot a new job started. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
We are working on the current problem with uploading results. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
We are working on the current problem with uploading results. Problem now solved - system should be working well again. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
We are working on the current problem with uploading results. Confirmed! Thanks, Ben. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Should be good upto 3 hours now. A short test with 61 minutes pause. The job runs on after the resume and also next jobs running fine. The only info found was in MasterLog: 05/12/16 15:18:50 Preen pid is 4752 05/12/16 15:18:50 DefaultReaper unexpectedly called on pid 4752, status 0. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for? Not everyone has the habit to start BOINC first before breakfast, so job/task-survival of up to 12 hours pausing would be fine in my opinion. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 9 |
Are we still just talking about Suspend/Resume or do jobs now survive a Boinc Exit? I've not been able to do any testing in that regard this week and it'll be evening before I'll be able to try. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
It depends what happens on a BOINC exit. If it saves to disk then it might work but I haven't tested it. My tests have been related to just pausing the VM. The StarterLog is most informative if it doesn't work. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
If the BOINC client is exited normally a snapshot of the VM will be saved in the slots/X/boinc_xxxxxxxxxxxxxxxx/Snapshots directory. The same happens when you suspend a task with LAIM off. However when you have several VM's running controlled by BOINC, all those VM's should be saved to disk within 60 seconds, else tasks too late saved, will be treated like a computation error. Resuming the task and thus restoring the VM from disk is working fine, but I didn't tested it recently for longer periods of inactivity. I'm testing it now. Which period of pausing shall I test first before doing the overnight test? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Disconnecting the internet for 30min causes still -Job to finish and upload -no new job is started 05/13/16 18:29:04 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed. is repeated over and over first message of this appeared 30 sec after disconnect in start.log This indicates, that even a very short internet disconnect causes the task to hang (forever?) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I suspended the task with LAIM off for 3 hours and 5 minutes. After the resume the VM was shutdown within one minute. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173962 2016-05-13 19:12:01 (7368): VM Completion Message: Condor exited with return value 99. |
©2024 CERN