Message boards : Theory Application : Suspend/Resume Theory
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 3354 - Posted: 13 May 2016, 22:29:14 UTC

Exited Boinc to allow VMs to save before applying some Windows updates.
On host restart, one host gracefully finished the Task it had been running for 14hrs (lost connection and over 12hrs so expected behaviour)
On another host I watched more closely,
10hrs or so, running job 9.
Continued with that job to completion but from StarterLog at the time of Exit/Reboot/Restart:
05/13/16 20:00:43 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/13/16 20:00:43 IO: Failed to read packet header
05/13/16 20:05:43 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11
05/13/16 20:05:43 Buf::write(): condor_write() failed

every 5mins until the job finished then:

05/13/16 21:15:05 Process exited, pid=3005, status=0
05/13/16 21:15:05 About to exec Post script: /var/lib/condor/execute/dir_3001/tarOutput.sh 2016-562437-256
05/13/16 21:15:05 Create_Process succeeded, pid=13146
05/13/16 21:15:05 Process exited, pid=13146, status=0
05/13/16 21:15:05 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11
05/13/16 21:15:05 Buf::write(): condor_write() failed
05/13/16 21:15:06 condor_write(): Socket closed when trying to write 364 bytes to <188.184.187.167:9618>, fd is 11
05/13/16 21:15:06 Buf::write(): condor_write() failed
05/13/16 21:15:06 Failed to send job exit status to shadow
05/13/16 21:15:06 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
(don't know if it has indeed been marked as a failure as it's not been updated in McPlots yet.

then nothing for 30mins (no messages, no activity) until:

05/13/16 21:45:52 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch.
05/13/16 21:45:52 IO: Failed to read packet header
05/13/16 21:45:52 CCBListener: failed to receive message from CCB server alicondor01.cern.ch
05/13/16 21:45:52 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/13/16 21:46:53 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#45652
05/13/16 21:50:30 Accepted request to reconnect from <188.184.187.167:9618>
05/13/16 21:50:30 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333>
05/13/16 21:50:30 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333>
05/13/16 21:50:30 Job cleanup finished, now Starter is exiting
05/13/16 21:50:30 **** condor_starter (condor_STARTER) pid 3001 EXITING WITH STATUS 0
05/13/16 21:50:32 ******************************************************
05/13/16 21:50:32 ** condor_starter (CONDOR_STARTER) STARTING UP

and we're back in business running a new job. I watched that job finish successfully but there looked like there was going to be another delay in starting the next job but as the Task was, by then older than 12hrs, it terminated, so it maybe didn't fully recover from the earlier interruption.

At least there's no loss of work.

Replacement Task purring along nicely so I'll leave it in peace.
Off to bed. Up early for golf.
ID: 3354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3355 - Posted: 14 May 2016, 5:31:53 UTC

2 tasks and with them 2 inside jobs did not survive the overnight pausing.
Both were ended by the gracefully shutdown.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173846
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174391
ID: 3355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3359 - Posted: 14 May 2016, 10:02:35 UTC
Last modified: 14 May 2016, 10:04:34 UTC

Short pauses with LAIM off doesn't upload the result successful.
Although the job runs to its end after the resume, it will not end properly.

05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

Sometimes a new CONDOR is started directly after that fail, but a new CONDOR start can also be long in coming.

05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
05/14/16 11:56:18 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch.
ID: 3359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3360 - Posted: 14 May 2016, 19:41:28 UTC - in response to Message 3359.  

Yes, when suspending the VM rather than pausing the VM I am seeing different a behaviour. We already knew this but now I have a test VM ready to investigate. This issues seems quite important and we must be able to survive both restarting boinc and restarting the computing.
ID: 3360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3361 - Posted: 14 May 2016, 20:22:25 UTC

I think, it would be better to do damage limitation rather than spending too much time in trying to recover the job in progress.
It is far more important to continue running without too much of a delay.

If that means, to dismiss the current job, so be it.
(best, of course, if it can be finished)
ID: 3361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3393 - Posted: 19 May 2016, 12:31:25 UTC - in response to Message 3359.  

With new tasks, suspend/resume should work for at least 1 hour with LAIM off or on. Please try everything, eg closing boinc and rebooting the machine. If there are any problems, please let us know. The StarterLog contains the most useful information on this.
ID: 3393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3394 - Posted: 19 May 2016, 13:29:58 UTC

I tried starting up 3 tasks.
The first two failed, third succeeded.
Then it started the fourth(actually the second, as the first two failed)and did not download another one as QUOTA kicked in.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177362
ID: 3394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3405 - Posted: 19 May 2016, 16:27:19 UTC - in response to Message 3393.  

First short suspend including LAIM off, BOINC exit, shutdown and host cold restart - Job resumed fine.
Contents of StarterLog after the boot:

05/19/16 18:18:51 CCBListener: no activity from CCB server in 1923s; assuming connection is dead.
05/19/16 18:18:51 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/19/16 18:18:51 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/19/16 18:18:51 IO: Failed to read packet header
05/19/16 18:19:51 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51378
05/19/16 18:19:54 Accepted request to reconnect from <188.184.187.167:9618>
05/19/16 18:19:54 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489>
05/19/16 18:19:54 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489>
ID: 3405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3407 - Posted: 19 May 2016, 18:53:10 UTC

Second test with a suspend (LAIM off) period just over 1 hour.

05/19/16 18:40:37 Create_Process succeeded, pid=16261
05/19/16 19:45:34 CCBListener: no activity from CCB server in 3898s; assuming connection is dead.
05/19/16 19:45:34 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/19/16 19:45:34 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/19/16 19:45:34 IO: Failed to read packet header
05/19/16 19:46:35 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51426
05/19/16 19:50:35 condor_write(): Socket closed when trying to write 410 bytes to <188.184.187.167:9618>, fd is 11
05/19/16 19:50:35 Buf::write(): condor_write() failed
05/19/16 19:51:08 Accepted request to reconnect from <188.184.187.167:9618>
05/19/16 19:51:08 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489>
05/19/16 19:51:08 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489>


Job went on from the position where it was saved to disk before.
ID: 3407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3413 - Posted: 20 May 2016, 7:37:38 UTC - in response to Message 3407.  

That output is what we expect to see when it notices that contact has been lost and then reconnects.

Overnight I successfully tested a suspend/resume of 14 hours. The job carried on from where it left off, ended successfully and a new job runs straight afterwards.
ID: 3413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3416 - Posted: 20 May 2016, 9:12:56 UTC

I didn't dare to test for such a long period, but from next pausing the job survived well not surprising you now:

05/20/16 11:02:32 CCBListener: no activity from CCB server in 5527s; assuming connection is dead.
05/20/16 11:02:32 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/20/16 11:02:32 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/20/16 11:02:32 IO: Failed to read packet header
05/20/16 11:03:33 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#52217
05/20/16 11:05:40 Accepted request to reconnect from <188.184.187.167:9618>
05/20/16 11:05:40 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4905>
ID: 3416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3425 - Posted: 20 May 2016, 14:54:54 UTC - in response to Message 3416.  
Last modified: 20 May 2016, 14:55:13 UTC

I wasn't planning on testing for that long but it failed to automatically resume after 5 hours and I only found out when I woke up :)
ID: 3425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3432 - Posted: 20 May 2016, 20:32:10 UTC - in response to Message 3425.  

Can't read the StarterLog, cause the VM is shutdown and gone 1 minute after the resume:

vLHCathome-dev 20 May 15:52:26 task Theory_26070_1463692400.735061_0 suspended by user
vLHCathome-dev 20 May 22:22:04 task Theory_26070_1463692400.735061_0 resumed by user
vLHCathome-dev 20 May 22:23:38 Computation for task Theory_26070_1463692400.735061_0 finished


Task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177705 saying:

2016-05-20 22:23:03 (6936): Guest Log: [INFO] Condor exited with return value 99.
ID: 3432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3435 - Posted: 20 May 2016, 21:02:20 UTC - in response to Message 3432.  

I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates.

Give it a few hours for the update to propagate and see if you can reproduce this.
ID: 3435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3436 - Posted: 20 May 2016, 21:08:19 UTC - in response to Message 3435.  
Last modified: 20 May 2016, 21:10:56 UTC

I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates.


If it gets a new task,would it not write over it?It usually reuses the same slot, when a Theory-task finishes and a new one starts.
Therfore, one would have to suspend waiting Theory-tasks and disable work fetch temporarily.


EDIT: attempted shutting down boinc for 100min(LAIM OFF), seemed to resume,but did not manage the reconnect to the shadow.
ID: 3436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3437 - Posted: 21 May 2016, 5:57:14 UTC - in response to Message 3435.  

I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates.

I'll have to use my 'dawn'-method ;)
What Rasputin said: the whole slot will be cleared including the shared folder when BOINC's task finish.
ID: 3437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 343
Message 3446 - Posted: 21 May 2016, 12:49:42 UTC - in response to Message 3435.  

Give it a few hours for the update to propagate and see if you can reproduce this.

Reproducible!

No new info added after the resume in Starterlog, but lines after the resume in StartLog:

05/21/16 14:33:22 CCBListener: no activity from CCB server in 19928s; assuming connection is dead.
05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.


and lines after the resume in MasterLog:

05/21/16 14:33:22 condor_write(): Socket closed when trying to write 1368 bytes to collector alicondor01.cern.ch, fd is 8, errno=104 Connection reset by peer
05/21/16 14:33:22 Buf::write(): condor_write() failed
05/21/16 14:33:22 CCBListener: no activity from CCB server in 19929s; assuming connection is dead.
05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/21/16 14:33:22 ERROR: Child pid 4276 appears hung! Killing it hard.
05/21/16 14:33:22 DaemonCore: Can't receive command request from 10.0.2.15 (perhaps a timeout?)
05/21/16 14:33:22 DefaultReaper unexpectedly called on pid 4276, status 9.
05/21/16 14:33:22 The STARTD (pid 4276) was killed because it was no longer responding
05/21/16 14:33:22 Sending obituary for "/usr/sbin/condor_startd"
05/21/16 14:33:22 restarting /usr/sbin/condor_startd in 10 seconds
05/21/16 14:33:22 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
05/21/16 14:33:22 Got SIGTERM. Performing graceful shutdown.
05/21/16 14:33:43 SECMAN: required authentication with collector alicondor01.cern.ch failed, so aborting command INVALIDATE_MASTER_ADS.
05/21/16 14:33:43 ERROR: AUTHENTICATE:1005:Failed to securely exchange session key|AUTHENTICATE:1004:Failed to authenticate using FS
05/21/16 14:33:43 Failed to send update to collector alicondor01.cern.ch.
05/21/16 14:33:43 All daemons are gone. Exiting.
05/21/16 14:33:43 **** condor_master (condor_MASTER) pid 4267 EXITING WITH STATUS 99


Shutdown should be forced by vboxwrapper, but I avoided that by expecting a 'shutdawn' file (no typo).
So the logs were saved in the shared folder. All logs complete available.
I returned the task to the server for examination: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177691
ID: 3446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,928
RAC: 105
Message 3448 - Posted: 21 May 2016, 13:57:20 UTC - in response to Message 3446.  

Thanks for the information, I think that I understand the issue. The master is killing the children as it thinks they have hung due to the time skip caused by the suspend. The NOT_RESPONDING_TIMEOUT has been increased from 1 to 24 hours. Please try again.
ID: 3448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3449 - Posted: 21 May 2016, 17:28:37 UTC

Does this mean, that the upload is completed successfully?

Disk usage: 2000 Kb

CPU usage: 2853 s

Clean tmp ...

Run finished successfully



Or how can i tell?
ID: 3449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3450 - Posted: 21 May 2016, 17:36:16 UTC

I shut down boinc (LAIM OFF) and then the computer.
Total suspend time from 17.33 to 18.43 local time for the task(1hour10min).
Jobs appears to have have resumed and finished. Next job started.

05/21/16 17:26:23 Setting job's virtual memory rlimit to 0 megabytes
05/21/16 17:26:23 Running job as user nobody
05/21/16 17:26:23 Create_Process succeeded, pid=10768
05/21/16 18:44:27 CCBListener: no activity from CCB server in 4377s; assuming connection is dead.
05/21/16 18:44:27 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/21/16 18:44:27 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/21/16 18:44:27 IO: Failed to read packet header
05/21/16 18:45:28 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53558
05/21/16 18:48:20 Accepted request to reconnect from <188.184.187.167:9618>
05/21/16 18:48:20 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596>
05/21/16 18:48:20 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596>
05/21/16 19:11:44 Process exited, pid=10768, status=0
05/21/16 19:11:44 About to exec Post script: /var/lib/condor/execute/dir_10764/tarOutput.sh 2016-547956-268
05/21/16 19:11:44 Create_Process succeeded, pid=14966
05/21/16 19:11:44 Process exited, pid=14966, status=0
05/21/16 19:11:45 Got SIGQUIT. Performing fast shutdown.
05/21/16 19:11:45 ShutdownFast all jobs.
05/21/16 19:11:45 **** condor_starter (condor_STARTER) pid 10764 EXITING WITH STATUS 0
05/21/16 19:11:46 ******************************************************
05/21/16 19:11:46 ** condor_starter (CONDOR_STARTER) STARTING UP
05/21/16 19:11:46 ** /usr/sbin/condor_starter
ID: 3450 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : Suspend/Resume Theory


©2024 CERN