Suspend/Resume Theory

Author	Message
Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,952,515 RAC: 374	Message 3354 - Posted: 13 May 2016, 22:29:14 UTC Exited Boinc to allow VMs to save before applying some Windows updates. On host restart, one host gracefully finished the Task it had been running for 14hrs (lost connection and over 12hrs so expected behaviour) On another host I watched more closely, 10hrs or so, running job 9. Continued with that job to completion but from StarterLog at the time of Exit/Reboot/Restart: 05/13/16 20:00:43 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/13/16 20:00:43 IO: Failed to read packet header 05/13/16 20:05:43 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 20:05:43 Buf::write(): condor_write() failed every 5mins until the job finished then: 05/13/16 21:15:05 Process exited, pid=3005, status=0 05/13/16 21:15:05 About to exec Post script: /var/lib/condor/execute/dir_3001/tarOutput.sh 2016-562437-256 05/13/16 21:15:05 Create_Process succeeded, pid=13146 05/13/16 21:15:05 Process exited, pid=13146, status=0 05/13/16 21:15:05 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 21:15:05 Buf::write(): condor_write() failed 05/13/16 21:15:06 condor_write(): Socket closed when trying to write 364 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 21:15:06 Buf::write(): condor_write() failed 05/13/16 21:15:06 Failed to send job exit status to shadow 05/13/16 21:15:06 JobExit() failed, waiting for job lease to expire or for a reconnect attempt (don't know if it has indeed been marked as a failure as it's not been updated in M^cPlots yet. then nothing for 30mins (no messages, no activity) until: 05/13/16 21:45:52 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch. 05/13/16 21:45:52 IO: Failed to read packet header 05/13/16 21:45:52 CCBListener: failed to receive message from CCB server alicondor01.cern.ch 05/13/16 21:45:52 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/13/16 21:46:53 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#45652 05/13/16 21:50:30 Accepted request to reconnect from <188.184.187.167:9618> 05/13/16 21:50:30 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333> 05/13/16 21:50:30 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333> 05/13/16 21:50:30 Job cleanup finished, now Starter is exiting 05/13/16 21:50:30 ** condor_starter (condor_STARTER) pid 3001 EXITING WITH STATUS 0 05/13/16 21:50:32 ************************************************** 05/13/16 21:50:32 condor_starter (CONDOR_STARTER) STARTING UP and we're back in business running a new job. I watched that job finish successfully but there looked like there was going to be another delay in starting the next job but as the Task was, by then older than 12hrs, it terminated, so it maybe didn't fully recover from the earlier interruption. At least there's no loss of work. Replacement Task purring along nicely so I'll leave it in peace. Off to bed. Up early for golf. ID: 3354 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3355 - Posted: 14 May 2016, 5:31:53 UTC 2 tasks and with them 2 inside jobs did not survive the overnight pausing. Both were ended by the gracefully shutdown. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173846 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174391 ID: 3355 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3359 - Posted: 14 May 2016, 10:02:35 UTC Last modified: 14 May 2016, 10:04:34 UTC Short pauses with LAIM off doesn't upload the result successful. Although the job runs to its end after the resume, it will not end properly. 05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt Sometimes a new CONDOR is started directly after that fail, but a new CONDOR start can also be long in coming. 05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 05/14/16 11:56:18 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch. ID: 3359 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3360 - Posted: 14 May 2016, 19:41:28 UTC - in response to Message 3359. Yes, when suspending the VM rather than pausing the VM I am seeing different a behaviour. We already knew this but now I have a test VM ready to investigate. This issues seems quite important and we must be able to survive both restarting boinc and restarting the computing. ID: 3360 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3361 - Posted: 14 May 2016, 20:22:25 UTC I think, it would be better to do damage limitation rather than spending too much time in trying to recover the job in progress. It is far more important to continue running without too much of a delay. If that means, to dismiss the current job, so be it. (best, of course, if it can be finished) ID: 3361 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3393 - Posted: 19 May 2016, 12:31:25 UTC - in response to Message 3359. With new tasks, suspend/resume should work for at least 1 hour with LAIM off or on. Please try everything, eg closing boinc and rebooting the machine. If there are any problems, please let us know. The StarterLog contains the most useful information on this. ID: 3393 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3394 - Posted: 19 May 2016, 13:29:58 UTC I tried starting up 3 tasks. The first two failed, third succeeded. Then it started the fourth(actually the second, as the first two failed)and did not download another one as QUOTA kicked in. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177362 ID: 3394 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3405 - Posted: 19 May 2016, 16:27:19 UTC - in response to Message 3393. First short suspend including LAIM off, BOINC exit, shutdown and host cold restart - Job resumed fine. Contents of StarterLog after the boot: 05/19/16 18:18:51 CCBListener: no activity from CCB server in 1923s; assuming connection is dead. 05/19/16 18:18:51 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/19/16 18:18:51 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/19/16 18:18:51 IO: Failed to read packet header 05/19/16 18:19:51 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51378 05/19/16 18:19:54 Accepted request to reconnect from <188.184.187.167:9618> 05/19/16 18:19:54 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> 05/19/16 18:19:54 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> ID: 3405 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3407 - Posted: 19 May 2016, 18:53:10 UTC Second test with a suspend (LAIM off) period just over 1 hour. 05/19/16 18:40:37 Create_Process succeeded, pid=16261 05/19/16 19:45:34 CCBListener: no activity from CCB server in 3898s; assuming connection is dead. 05/19/16 19:45:34 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/19/16 19:45:34 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/19/16 19:45:34 IO: Failed to read packet header 05/19/16 19:46:35 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51426 05/19/16 19:50:35 condor_write(): Socket closed when trying to write 410 bytes to <188.184.187.167:9618>, fd is 11 05/19/16 19:50:35 Buf::write(): condor_write() failed 05/19/16 19:51:08 Accepted request to reconnect from <188.184.187.167:9618> 05/19/16 19:51:08 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> 05/19/16 19:51:08 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> Job went on from the position where it was saved to disk before. ID: 3407 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3413 - Posted: 20 May 2016, 7:37:38 UTC - in response to Message 3407. That output is what we expect to see when it notices that contact has been lost and then reconnects. Overnight I successfully tested a suspend/resume of 14 hours. The job carried on from where it left off, ended successfully and a new job runs straight afterwards. ID: 3413 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3416 - Posted: 20 May 2016, 9:12:56 UTC I didn't dare to test for such a long period, but from next pausing the job survived well not surprising you now: 05/20/16 11:02:32 CCBListener: no activity from CCB server in 5527s; assuming connection is dead. 05/20/16 11:02:32 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/20/16 11:02:32 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/20/16 11:02:32 IO: Failed to read packet header 05/20/16 11:03:33 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#52217 05/20/16 11:05:40 Accepted request to reconnect from <188.184.187.167:9618> 05/20/16 11:05:40 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4905> ID: 3416 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3425 - Posted: 20 May 2016, 14:54:54 UTC - in response to Message 3416. Last modified: 20 May 2016, 14:55:13 UTC I wasn't planning on testing for that long but it failed to automatically resume after 5 hours and I only found out when I woke up :) ID: 3425 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3432 - Posted: 20 May 2016, 20:32:10 UTC - in response to Message 3425. Can't read the StarterLog, cause the VM is shutdown and gone 1 minute after the resume: vLHCathome-dev 20 May 15:52:26 task Theory_26070_1463692400.735061_0 suspended by user vLHCathome-dev 20 May 22:22:04 task Theory_26070_1463692400.735061_0 resumed by user vLHCathome-dev 20 May 22:23:38 Computation for task Theory_26070_1463692400.735061_0 finished Task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177705 saying: 2016-05-20 22:23:03 (6936): Guest Log: [INFO] Condor exited with return value 99. ID: 3432 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3435 - Posted: 20 May 2016, 21:02:20 UTC - in response to Message 3432. I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. Give it a few hours for the update to propagate and see if you can reproduce this. ID: 3435 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3436 - Posted: 20 May 2016, 21:08:19 UTC - in response to Message 3435. Last modified: 20 May 2016, 21:10:56 UTC I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. If it gets a new task,would it not write over it?It usually reuses the same slot, when a Theory-task finishes and a new one starts. Therfore, one would have to suspend waiting Theory-tasks and disable work fetch temporarily. EDIT: attempted shutting down boinc for 100min(LAIM OFF), seemed to resume,but did not manage the reconnect to the shadow. ID: 3436 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3437 - Posted: 21 May 2016, 5:57:14 UTC - in response to Message 3435. I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. I'll have to use my 'dawn'-method ;) What Rasputin said: the whole slot will be cleared including the shared folder when BOINC's task finish. ID: 3437 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3446 - Posted: 21 May 2016, 12:49:42 UTC - in response to Message 3435. Give it a few hours for the update to propagate and see if you can reproduce this. Reproducible! No new info added after the resume in Starterlog, but lines after the resume in StartLog: 05/21/16 14:33:22 CCBListener: no activity from CCB server in 19928s; assuming connection is dead. 05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. and lines after the resume in MasterLog: 05/21/16 14:33:22 condor_write(): Socket closed when trying to write 1368 bytes to collector alicondor01.cern.ch, fd is 8, errno=104 Connection reset by peer 05/21/16 14:33:22 Buf::write(): condor_write() failed 05/21/16 14:33:22 CCBListener: no activity from CCB server in 19929s; assuming connection is dead. 05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 14:33:22 ERROR: Child pid 4276 appears hung! Killing it hard. 05/21/16 14:33:22 DaemonCore: Can't receive command request from 10.0.2.15 (perhaps a timeout?) 05/21/16 14:33:22 DefaultReaper unexpectedly called on pid 4276, status 9. 05/21/16 14:33:22 The STARTD (pid 4276) was killed because it was no longer responding 05/21/16 14:33:22 Sending obituary for "/usr/sbin/condor_startd" 05/21/16 14:33:22 restarting /usr/sbin/condor_startd in 10 seconds 05/21/16 14:33:22 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/21/16 14:33:22 Got SIGTERM. Performing graceful shutdown. 05/21/16 14:33:43 SECMAN: required authentication with collector alicondor01.cern.ch failed, so aborting command INVALIDATE_MASTER_ADS. 05/21/16 14:33:43 ERROR: AUTHENTICATE:1005:Failed to securely exchange session key\|AUTHENTICATE:1004:Failed to authenticate using FS 05/21/16 14:33:43 Failed to send update to collector alicondor01.cern.ch. 05/21/16 14:33:43 All daemons are gone. Exiting. 05/21/16 14:33:43 **** condor_master (condor_MASTER) pid 4267 EXITING WITH STATUS 99 Shutdown should be forced by vboxwrapper, but I avoided that by expecting a 'shutdawn' file (no typo). So the logs were saved in the shared folder. All logs complete available. I returned the task to the server for examination: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177691 ID: 3446 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 3448 - Posted: 21 May 2016, 13:57:20 UTC - in response to Message 3446. Thanks for the information, I think that I understand the issue. The master is killing the children as it thinks they have hung due to the time skip caused by the suspend. The NOT_RESPONDING_TIMEOUT has been increased from 1 to 24 hours. Please try again. ID: 3448 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3449 - Posted: 21 May 2016, 17:28:37 UTC Does this mean, that the upload is completed successfully? Disk usage: 2000 Kb CPU usage: 2853 s Clean tmp ... Run finished successfully Or how can i tell? ID: 3449 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3450 - Posted: 21 May 2016, 17:36:16 UTC I shut down boinc (LAIM OFF) and then the computer. Total suspend time from 17.33 to 18.43 local time for the task(1hour10min). Jobs appears to have have resumed and finished. Next job started. 05/21/16 17:26:23 Setting job's virtual memory rlimit to 0 megabytes 05/21/16 17:26:23 Running job as user nobody 05/21/16 17:26:23 Create_Process succeeded, pid=10768 05/21/16 18:44:27 CCBListener: no activity from CCB server in 4377s; assuming connection is dead. 05/21/16 18:44:27 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 18:44:27 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/21/16 18:44:27 IO: Failed to read packet header 05/21/16 18:45:28 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53558 05/21/16 18:48:20 Accepted request to reconnect from <188.184.187.167:9618> 05/21/16 18:48:20 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596> 05/21/16 18:48:20 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596> 05/21/16 19:11:44 Process exited, pid=10768, status=0 05/21/16 19:11:44 About to exec Post script: /var/lib/condor/execute/dir_10764/tarOutput.sh 2016-547956-268 05/21/16 19:11:44 Create_Process succeeded, pid=14966 05/21/16 19:11:44 Process exited, pid=14966, status=0 05/21/16 19:11:45 Got SIGQUIT. Performing fast shutdown. 05/21/16 19:11:45 ShutdownFast all jobs. 05/21/16 19:11:45 ** condor_starter (condor_STARTER) pid 10764 EXITING WITH STATUS 0 05/21/16 19:11:46 ************************************************** 05/21/16 19:11:46 condor_starter (CONDOR_STARTER) STARTING UP 05/21/16 19:11:46 ** /usr/sbin/condor_starter ID: 3450 · Rating: 0 · rate: / Reply Quote

Development for LHC@home