Message boards : Theory Application : Suspend/Resume Theory
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
Exited Boinc to allow VMs to save before applying some Windows updates. On host restart, one host gracefully finished the Task it had been running for 14hrs (lost connection and over 12hrs so expected behaviour) On another host I watched more closely, 10hrs or so, running job 9. Continued with that job to completion but from StarterLog at the time of Exit/Reboot/Restart: 05/13/16 20:00:43 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/13/16 20:00:43 IO: Failed to read packet header 05/13/16 20:05:43 condor_write(): Socket closed when trying to write 409 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 20:05:43 Buf::write(): condor_write() failed every 5mins until the job finished then: 05/13/16 21:15:05 Process exited, pid=3005, status=0 05/13/16 21:15:05 About to exec Post script: /var/lib/condor/execute/dir_3001/tarOutput.sh 2016-562437-256 05/13/16 21:15:05 Create_Process succeeded, pid=13146 05/13/16 21:15:05 Process exited, pid=13146, status=0 05/13/16 21:15:05 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 21:15:05 Buf::write(): condor_write() failed 05/13/16 21:15:06 condor_write(): Socket closed when trying to write 364 bytes to <188.184.187.167:9618>, fd is 11 05/13/16 21:15:06 Buf::write(): condor_write() failed 05/13/16 21:15:06 Failed to send job exit status to shadow 05/13/16 21:15:06 JobExit() failed, waiting for job lease to expire or for a reconnect attempt (don't know if it has indeed been marked as a failure as it's not been updated in McPlots yet. then nothing for 30mins (no messages, no activity) until: 05/13/16 21:45:52 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch. 05/13/16 21:45:52 IO: Failed to read packet header 05/13/16 21:45:52 CCBListener: failed to receive message from CCB server alicondor01.cern.ch 05/13/16 21:45:52 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/13/16 21:46:53 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#45652 05/13/16 21:50:30 Accepted request to reconnect from <188.184.187.167:9618> 05/13/16 21:50:30 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333> 05/13/16 21:50:30 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_2333> 05/13/16 21:50:30 Job cleanup finished, now Starter is exiting 05/13/16 21:50:30 **** condor_starter (condor_STARTER) pid 3001 EXITING WITH STATUS 0 05/13/16 21:50:32 ****************************************************** 05/13/16 21:50:32 ** condor_starter (CONDOR_STARTER) STARTING UP and we're back in business running a new job. I watched that job finish successfully but there looked like there was going to be another delay in starting the next job but as the Task was, by then older than 12hrs, it terminated, so it maybe didn't fully recover from the earlier interruption. At least there's no loss of work. Replacement Task purring along nicely so I'll leave it in peace. Off to bed. Up early for golf. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
2 tasks and with them 2 inside jobs did not survive the overnight pausing. Both were ended by the gracefully shutdown. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=173846 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174391 |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
Short pauses with LAIM off doesn't upload the result successful. Although the job runs to its end after the resume, it will not end properly. 05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt Sometimes a new CONDOR is started directly after that fail, but a new CONDOR start can also be long in coming. 05/14/16 11:12:44 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 05/14/16 11:56:18 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector alicondor01.cern.ch. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
Yes, when suspending the VM rather than pausing the VM I am seeing different a behaviour. We already knew this but now I have a test VM ready to investigate. This issues seems quite important and we must be able to survive both restarting boinc and restarting the computing. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I think, it would be better to do damage limitation rather than spending too much time in trying to recover the job in progress. It is far more important to continue running without too much of a delay. If that means, to dismiss the current job, so be it. (best, of course, if it can be finished) |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
With new tasks, suspend/resume should work for at least 1 hour with LAIM off or on. Please try everything, eg closing boinc and rebooting the machine. If there are any problems, please let us know. The StarterLog contains the most useful information on this. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I tried starting up 3 tasks. The first two failed, third succeeded. Then it started the fourth(actually the second, as the first two failed)and did not download another one as QUOTA kicked in. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177362 |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
First short suspend including LAIM off, BOINC exit, shutdown and host cold restart - Job resumed fine. Contents of StarterLog after the boot: 05/19/16 18:18:51 CCBListener: no activity from CCB server in 1923s; assuming connection is dead. 05/19/16 18:18:51 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/19/16 18:18:51 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/19/16 18:18:51 IO: Failed to read packet header 05/19/16 18:19:51 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51378 05/19/16 18:19:54 Accepted request to reconnect from <188.184.187.167:9618> 05/19/16 18:19:54 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> 05/19/16 18:19:54 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
Second test with a suspend (LAIM off) period just over 1 hour. 05/19/16 18:40:37 Create_Process succeeded, pid=16261 05/19/16 19:45:34 CCBListener: no activity from CCB server in 3898s; assuming connection is dead. 05/19/16 19:45:34 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/19/16 19:45:34 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/19/16 19:45:34 IO: Failed to read packet header 05/19/16 19:46:35 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#51426 05/19/16 19:50:35 condor_write(): Socket closed when trying to write 410 bytes to <188.184.187.167:9618>, fd is 11 05/19/16 19:50:35 Buf::write(): condor_write() failed 05/19/16 19:51:08 Accepted request to reconnect from <188.184.187.167:9618> 05/19/16 19:51:08 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> 05/19/16 19:51:08 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4489> Job went on from the position where it was saved to disk before. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
That output is what we expect to see when it notices that contact has been lost and then reconnects. Overnight I successfully tested a suspend/resume of 14 hours. The job carried on from where it left off, ended successfully and a new job runs straight afterwards. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
I didn't dare to test for such a long period, but from next pausing the job survived well not surprising you now: 05/20/16 11:02:32 CCBListener: no activity from CCB server in 5527s; assuming connection is dead. 05/20/16 11:02:32 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/20/16 11:02:32 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/20/16 11:02:32 IO: Failed to read packet header 05/20/16 11:03:33 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#52217 05/20/16 11:05:40 Accepted request to reconnect from <188.184.187.167:9618> 05/20/16 11:05:40 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_4905> |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
I wasn't planning on testing for that long but it failed to automatically resume after 5 hours and I only found out when I woke up :) |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
Can't read the StarterLog, cause the VM is shutdown and gone 1 minute after the resume: vLHCathome-dev 20 May 15:52:26 task Theory_26070_1463692400.735061_0 suspended by user vLHCathome-dev 20 May 22:22:04 task Theory_26070_1463692400.735061_0 resumed by user vLHCathome-dev 20 May 22:23:38 Computation for task Theory_26070_1463692400.735061_0 finished Task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177705 saying: 2016-05-20 22:23:03 (6936): Guest Log: [INFO] Condor exited with return value 99. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. Give it a few hours for the update to propagate and see if you can reproduce this. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. If it gets a new task,would it not write over it?It usually reuses the same slot, when a Theory-task finishes and a new one starts. Therfore, one would have to suspend waiting Theory-tasks and disable work fetch temporarily. EDIT: attempted shutting down boinc for 100min(LAIM OFF), seemed to resume,but did not manage the reconnect to the shadow. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
I have just made a change so that the log files will be copied to the shared directory before the machine shuts down so that they will be available on the host when the VM terminates. I'll have to use my 'dawn'-method ;) What Rasputin said: the whole slot will be cleared including the shared folder when BOINC's task finish. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,186 RAC: 1,479 ![]() ![]() ![]() |
Give it a few hours for the update to propagate and see if you can reproduce this. Reproducible! No new info added after the resume in Starterlog, but lines after the resume in StartLog: 05/21/16 14:33:22 CCBListener: no activity from CCB server in 19928s; assuming connection is dead. 05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. and lines after the resume in MasterLog: 05/21/16 14:33:22 condor_write(): Socket closed when trying to write 1368 bytes to collector alicondor01.cern.ch, fd is 8, errno=104 Connection reset by peer 05/21/16 14:33:22 Buf::write(): condor_write() failed 05/21/16 14:33:22 CCBListener: no activity from CCB server in 19929s; assuming connection is dead. 05/21/16 14:33:22 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 14:33:22 ERROR: Child pid 4276 appears hung! Killing it hard. 05/21/16 14:33:22 DaemonCore: Can't receive command request from 10.0.2.15 (perhaps a timeout?) 05/21/16 14:33:22 DefaultReaper unexpectedly called on pid 4276, status 9. 05/21/16 14:33:22 The STARTD (pid 4276) was killed because it was no longer responding 05/21/16 14:33:22 Sending obituary for "/usr/sbin/condor_startd" 05/21/16 14:33:22 restarting /usr/sbin/condor_startd in 10 seconds 05/21/16 14:33:22 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/21/16 14:33:22 Got SIGTERM. Performing graceful shutdown. 05/21/16 14:33:43 SECMAN: required authentication with collector alicondor01.cern.ch failed, so aborting command INVALIDATE_MASTER_ADS. 05/21/16 14:33:43 ERROR: AUTHENTICATE:1005:Failed to securely exchange session key|AUTHENTICATE:1004:Failed to authenticate using FS 05/21/16 14:33:43 Failed to send update to collector alicondor01.cern.ch. 05/21/16 14:33:43 All daemons are gone. Exiting. 05/21/16 14:33:43 **** condor_master (condor_MASTER) pid 4267 EXITING WITH STATUS 99 Shutdown should be forced by vboxwrapper, but I avoided that by expecting a 'shutdawn' file (no typo). So the logs were saved in the shared folder. All logs complete available. I returned the task to the server for examination: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177691 |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 23 ![]() |
Thanks for the information, I think that I understand the issue. The master is killing the children as it thinks they have hung due to the time skip caused by the suspend. The NOT_RESPONDING_TIMEOUT has been increased from 1 to 24 hours. Please try again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Does this mean, that the upload is completed successfully?
Or how can i tell? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I shut down boinc (LAIM OFF) and then the computer. Total suspend time from 17.33 to 18.43 local time for the task(1hour10min). Jobs appears to have have resumed and finished. Next job started. 05/21/16 17:26:23 Setting job's virtual memory rlimit to 0 megabytes 05/21/16 17:26:23 Running job as user nobody 05/21/16 17:26:23 Create_Process succeeded, pid=10768 05/21/16 18:44:27 CCBListener: no activity from CCB server in 4377s; assuming connection is dead. 05/21/16 18:44:27 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 18:44:27 condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/21/16 18:44:27 IO: Failed to read packet header 05/21/16 18:45:28 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53558 05/21/16 18:48:20 Accepted request to reconnect from <188.184.187.167:9618> 05/21/16 18:48:20 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596> 05/21/16 18:48:20 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5596> 05/21/16 19:11:44 Process exited, pid=10768, status=0 05/21/16 19:11:44 About to exec Post script: /var/lib/condor/execute/dir_10764/tarOutput.sh 2016-547956-268 05/21/16 19:11:44 Create_Process succeeded, pid=14966 05/21/16 19:11:44 Process exited, pid=14966, status=0 05/21/16 19:11:45 Got SIGQUIT. Performing fast shutdown. 05/21/16 19:11:45 ShutdownFast all jobs. 05/21/16 19:11:45 **** condor_starter (condor_STARTER) pid 10764 EXITING WITH STATUS 0 05/21/16 19:11:46 ****************************************************** 05/21/16 19:11:46 ** condor_starter (CONDOR_STARTER) STARTING UP 05/21/16 19:11:46 ** /usr/sbin/condor_starter |
©2025 CERN