Message boards : Theory Application : Suspend/Resume Theory
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3287 - Posted: 6 May 2016, 16:02:45 UTC - in response to Message 3286.  

Suspend/Resume should now work upto 60mins.

Testing without Theory jobs running inside the VM makes no sense IMO.

We are out of jobs: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=209

Is it not possible to create a steady flow without manual intervention?
ID: 3287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 3288 - Posted: 6 May 2016, 17:28:20 UTC - in response to Message 3287.  

For CMS, there are no new jobs following an exit and resume.
ID: 3288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 3289 - Posted: 6 May 2016, 20:42:06 UTC
Last modified: 6 May 2016, 21:06:20 UTC

Theory Task running a little over an hour. 4 jobs completed.
Job in progress 37900 events. LAIM set.
Suspended in Boinc for 13 mins. VM showed Paused in VBox.
Resumed Task.
Picked up where it left off without any delay 8¬)

5 mins later, job at 59700, Exited Boinc.
VM saved in VBox.
Restarted Boinc. A minute or so to Restore and reconnect Console. Job continuing unaffected.

Woo Hoo. Good job, guys.

The standard swap-out time for Boinc to run another Project is 60 mins so it would be good if Tasks could be Suspended a little beyond that just to allow for any changeover delays.
--------------
Hmm, maybe spoke too soon.
Job I interfered with above has finished. Looks OK in Console (Run finished successfully) but no new job has arrived yet and the log still shows it as a running.log.
srderr shows:
20:02:37 +0100 2016-05-06 [INFO] New Job Starting
20:02:37 +0100 2016-05-06 [INFO] Condor JobID: 297910
20:02:42 +0100 2016-05-06 [INFO] MCPlots JobID: 30109085
20:39:32 +0100 2016-05-06 [INFO] Job finished with .
(Which looks OK)

StarterLog shows:
05/06/16 20:02:37 Running job as user nobody
05/06/16 20:02:37 Create_Process succeeded, pid=17651

No mention of the Suspend/Resume (and no other entries) in this time period 20:02 - 20:31 but after the Boinc exit/restart, even though the job APPEARED to be unaffected, it doesn't look like it enjoyed the Restart. The log continues:

05/06/16 20:31:18 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/06/16 20:31:18 IO: Failed to read packet header
05/06/16 20:36:19 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11
05/06/16 20:36:19 Buf::write(): condor_write() failed
05/06/16 20:39:32 Process exited, pid=17651, status=0
05/06/16 20:39:32 About to exec Post script: /var/lib/condor/execute/dir_17647/tarOutput.sh 2016-542355-247
05/06/16 20:39:32 Create_Process succeeded, pid=22628
05/06/16 20:39:33 Process exited, pid=22628, status=0
05/06/16 20:39:33 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11
05/06/16 20:39:33 Buf::write(): condor_write() failed
05/06/16 20:39:35 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11
05/06/16 20:39:35 Buf::write(): condor_write() failed
05/06/16 20:39:35 Failed to send job exit status to shadow
05/06/16 20:39:35 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

Still no connection 20 mins later so I attempted another Boinc restart to see if that would force a retry. Console opens to the same "Run finished successfully" message, but now I can't access the logs.
Didn't know how long it would sit like that so I've Reset the VM in VBox.
Started up OK but no joy and task timed-out while I wasn't looking.

So....
Suspend/Resume OK but Exit/Restart not.

New replacement Task didn't get any jobs either and timed-out so maybe they've all gone but it's too late in the day for any more fiddling anyway.
Hope the above is understandable and useful.
---------
5 failed jobs from that host today so I'll maybe check that figure before doing any further interference 2moro. Other hosts are full of CMS Tasks just now.
ID: 3289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3291 - Posted: 7 May 2016, 18:49:34 UTC - in response to Message 3286.  

Suspend/Resume should now work upto 60mins.

Tested with LAIM on -> 55 minutes OK

The resumed job runs on and after the upload a new job was received.
ID: 3291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 3292 - Posted: 8 May 2016, 6:57:09 UTC - in response to Message 3291.  
Last modified: 8 May 2016, 6:59:54 UTC

74mins suspended and all's well after Resume so hopefully that's that sorted although testing with LAIM off might be useful too, just for completeness.
Not tried Exit/Restart again. Might do that later 2day when I've some time to watch closely.
ID: 3292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3294 - Posted: 9 May 2016, 11:50:17 UTC - in response to Message 3292.  

Should be good upto 3 hours now.
ID: 3294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3295 - Posted: 9 May 2016, 18:06:34 UTC - in response to Message 3294.  
Last modified: 9 May 2016, 18:07:24 UTC

Should be good upto 3 hours now.

Should be, but isn't:

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=172162

I resumed the VM after a suspend with LAIM on for 2.5 hours.
It processed a few 100 events and then:

2016-05-09 20:02:31 (5912): Guest Log: [INFO] Condor exited with return value 99.
2016-05-09 20:02:31 (5912): Guest Log: [INFO] Shutting Down.
ID: 3295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3309 - Posted: 11 May 2016, 9:11:34 UTC
Last modified: 11 May 2016, 9:34:54 UTC

Retested with 65 minutes pausing and LAIM on.

Task runs on after resume and after the job finished it seemly failed to upload the result:

05/11/16 10:27:39 Process exited, pid=4334, status=0
05/11/16 10:27:39 About to exec Post script: /var/lib/condor/execute/dir_4330/tarOutput.sh /var/spool/job-staging/2016-563005-252.run
05/11/16 10:27:39 Create_Process succeeded, pid=10097
05/11/16 10:27:39 Process exited, pid=10097, status=2
05/11/16 10:27:39 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_4330/2016-563005-252.tgz': No such file or directory (errno: 2, si_error: 1)
05/11/16 10:27:39 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_4330/2016-563005-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:50862>
05/11/16 10:27:39 JICShadow::notifyJobTermination(): Sending mock terminate event.
05/11/16 10:27:40 JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
05/11/16 10:27:40 Got SIGQUIT. Performing fast shutdown.
05/11/16 10:27:40 ShutdownFast all jobs.


Condor was restarted and a new job ran, but also that result failed to upload.

Logs available.

Edit: A shorter pausing task (22 min) also has a problem after the job finished.

05/11/16 11:18:52 Process exited, pid=26782, status=0
05/11/16 11:18:52 About to exec Post script: /var/lib/condor/execute/dir_26778/tarOutput.sh /var/spool/job-staging/2016-570169-252.run
05/11/16 11:18:52 Create_Process succeeded, pid=28285
05/11/16 11:18:52 Process exited, pid=28285, status=2
05/11/16 11:18:52 condor_write(): Socket closed when trying to write 582 bytes to <188.184.187.167:9618>, fd is 11
05/11/16 11:18:52 Buf::write(): condor_write() failed
05/11/16 11:18:53 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_26778/2016-570169-252.tgz': No such file or directory (errno: 2, si_error: 1)
05/11/16 11:18:53 DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error reading from /var/lib/condor/execute/dir_26778/2016-570169-252.tgz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <XXX.XXX.XXX.XXX:51321>
05/11/16 11:18:53 JICShadow::notifyJobTermination(): Sending mock terminate event.
05/11/16 11:18:53 condor_write(): Socket closed when trying to write 316 bytes to <188.184.187.167:9618>, fd is 11
05/11/16 11:18:53 Buf::write(): condor_write() failed
05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 342 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp
05/11/16 11:18:53 condor_write(): Socket closed when trying to write 198 bytes to <188.184.187.167:9618>, fd is 11
05/11/16 11:18:53 Buf::write(): condor_write() failed
05/11/16 11:18:53 ERROR "Assertion ERROR on (result)" at line 902 in file /slots/01/dir_57518/userdir/src/condor_starter.V6.1/NTsenders.cpp
ID: 3309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3310 - Posted: 11 May 2016, 10:38:16 UTC

The VM mentioned in the former post after last job was idling over an hour.
No new job and also no forced shutdown.

To get rid of this laziness, I rebooted the VM.
After the reboot a new job started.
ID: 3310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 3311 - Posted: 11 May 2016, 11:09:13 UTC - in response to Message 3309.  

We are working on the current problem with uploading results.
ID: 3311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 3316 - Posted: 12 May 2016, 7:14:23 UTC - in response to Message 3311.  

We are working on the current problem with uploading results.

Problem now solved - system should be working well again.
ID: 3316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3319 - Posted: 12 May 2016, 10:29:48 UTC - in response to Message 3316.  

We are working on the current problem with uploading results.

Problem now solved - system should be working well again.

Confirmed! Thanks, Ben.
ID: 3319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3321 - Posted: 12 May 2016, 19:29:26 UTC - in response to Message 3294.  

Should be good upto 3 hours now.

A short test with 61 minutes pause.

The job runs on after the resume and also next jobs running fine.
The only info found was in MasterLog:

05/12/16 15:18:50 Preen pid is 4752
05/12/16 15:18:50 DefaultReaper unexpectedly called on pid 4752, status 0.
ID: 3321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3333 - Posted: 13 May 2016, 9:54:37 UTC - in response to Message 3321.  

I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for?
ID: 3333 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3336 - Posted: 13 May 2016, 10:32:34 UTC - in response to Message 3333.  
Last modified: 13 May 2016, 10:41:56 UTC

I successfully did a suspend/resume over night (8h). What is the acceptable target we should aim for?

Not everyone has the habit to start BOINC first before breakfast, so job/task-survival of up to 12 hours pausing would be fine in my opinion.
ID: 3336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 3338 - Posted: 13 May 2016, 12:47:01 UTC

Are we still just talking about Suspend/Resume or do jobs now survive a Boinc Exit? I've not been able to do any testing in that regard this week and it'll be evening before I'll be able to try.
ID: 3338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3341 - Posted: 13 May 2016, 12:51:26 UTC - in response to Message 3338.  

It depends what happens on a BOINC exit. If it saves to disk then it might work but I haven't tested it. My tests have been related to just pausing the VM. The StarterLog is most informative if it doesn't work.
ID: 3341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3345 - Posted: 13 May 2016, 14:21:36 UTC

If the BOINC client is exited normally a snapshot of the VM will be saved in the slots/X/boinc_xxxxxxxxxxxxxxxx/Snapshots directory.
The same happens when you suspend a task with LAIM off.
However when you have several VM's running controlled by BOINC, all those VM's should be saved to disk within 60 seconds, else tasks too late saved, will be treated like a computation error.

Resuming the task and thus restoring the VM from disk is working fine, but I didn't tested it recently for longer periods of inactivity.

I'm testing it now. Which period of pausing shall I test first before doing the overnight test?
ID: 3345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3348 - Posted: 13 May 2016, 17:12:51 UTC
Last modified: 13 May 2016, 17:16:12 UTC

Disconnecting the internet for 30min causes still
-Job to finish and upload
-no new job is started

05/13/16 18:29:04 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed.
05/13/16 18:29:04 Failed to start non-blocking update to <188.184.129.127:9618>.
05/13/16 18:29:34 attempt to connect to <188.184.129.127:9618> failed: No route to host (connect errno = 113).
05/13/16 18:29:34 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed.
05/13/16 18:29:34 Failed to start non-blocking update to <188.184.129.127:9618>.


is repeated over and over

first message of this appeared 30 sec after disconnect in start.log

This indicates, that even a very short internet disconnect causes the task to hang (forever?)
ID: 3348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 3350 - Posted: 13 May 2016, 17:17:48 UTC

I suspended the task with LAIM off for 3 hours and 5 minutes.
After the resume the VM was shutdown within one minute.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173962

2016-05-13 19:12:01 (7368): VM Completion Message: Condor exited with return value 99.
ID: 3350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : Suspend/Resume Theory


©2024 CERN