Message boards :
Theory Application :
Suspend/Resume Theory
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Retested it but no success yet. Maybe I've to test it with a fresh task/VM tomorrow. 05/21/16 23:22:41 CCBListener: no activity from CCB server in 20183s; assuming connection is dead. 05/21/16 23:22:41 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 23:22:41 ERROR: Child pid 4287 appears hung! Killing it hard. 05/21/16 23:22:41 condor_write(): Socket closed when trying to write 23 bytes to <10.0.2.15:53982>, fd is 13, errno=104 Connection reset by peer 05/21/16 23:22:41 Buf::write(): condor_write() failed 05/21/16 23:22:41 AUTHENTICATE: handshake failed! 05/21/16 23:22:41 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS 05/21/16 23:22:41 DefaultReaper unexpectedly called on pid 4287, status 9. 05/21/16 23:22:41 The STARTD (pid 4287) was killed because it was no longer responding 05/21/16 23:22:41 Sending obituary for "/usr/sbin/condor_startd" 05/21/16 23:22:41 restarting /usr/sbin/condor_startd in 10 seconds 05/21/16 23:22:41 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/21/16 23:22:41 condor_write(): Socket closed when trying to write 1362 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer 05/21/16 23:22:41 Buf::write(): condor_write() failed 05/21/16 23:22:41 Got SIGTERM. Performing graceful shutdown. 05/21/16 23:22:42 All daemons are gone. Exiting. 05/21/16 23:22:42 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
A second task I started yesterday and tried to survive overnight without success. No lines added to StarterLog Last lines just before suspend and after resume this morning from StartLog: 05/21/16 22:49:50 Changing activity: Idle -> Busy 05/21/16 23:20:04 condor_write(): Socket closed when trying to write 2849 bytes to collector alicondor01.cern.ch, fd is 9, errno=104 Connection reset by peer 05/21/16 23:20:04 Buf::write(): condor_write() failed 05/22/16 07:06:58 CCBListener: no activity from CCB server in 28028s; assuming connection is dead. 05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 07:06:58 ERROR: Child pid 24802 appears hung! Killing it hard. 05/22/16 07:06:58 Starter pid 24802 died on signal 9 (signal 9 (Killed)) Lines from MasterLog after the resume: 05/22/16 07:06:58 CCBListener: no activity from CCB server in 28290s; assuming connection is dead. 05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 07:06:58 condor_write(): Socket closed when trying to write 1371 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer 05/22/16 07:06:58 Buf::write(): condor_write() failed 05/22/16 07:06:58 ERROR: Child pid 4286 appears hung! Killing it hard. 05/22/16 07:06:58 condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:37612>. 05/22/16 07:06:58 IO: Failed to read packet header 05/22/16 07:06:58 AUTHENTICATE: handshake failed! 05/22/16 07:06:58 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake 05/22/16 07:06:58 DefaultReaper unexpectedly called on pid 4286, status 9. 05/22/16 07:06:58 The STARTD (pid 4286) was killed because it was no longer responding 05/22/16 07:06:58 Sending obituary for "/usr/sbin/condor_startd" 05/22/16 07:06:58 restarting /usr/sbin/condor_startd in 10 seconds 05/22/16 07:06:58 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/22/16 07:06:59 Got SIGTERM. Performing graceful shutdown. 05/22/16 07:06:59 All daemons are gone. Exiting. 05/22/16 07:06:59 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99 I suspended a new task started this morning. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I suspended a new task started this morning. I resumed the task after 5.5 hours pausing (LAIM off). This time the job restarted processing the events (thumbs up) and hopefully also a new job will arrive after current one is finished. From StarterLog: 05/22/16 07:15:13 Create_Process succeeded, pid=4327 05/22/16 12:52:05 CCBListener: no activity from CCB server in 20212s; assuming connection is dead. 05/22/16 12:52:05 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 12:52:05 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/22/16 12:52:05 IO: Failed to read packet header 05/22/16 12:53:06 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53910 05/22/16 12:57:06 condor_write(): Socket closed when trying to write 404 bytes to <188.184.187.167:9618>, fd is 8 05/22/16 12:57:06 Buf::write(): condor_write() failed 05/22/16 12:57:19 Accepted request to reconnect from <188.184.187.167:9618> 05/22/16 12:57:19 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780> 05/22/16 12:57:19 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780> What's next? Surviving the overnight host shutdown . . . |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I resumed the task after 5.5 hours pausing (LAIM off). Interesting observation, but maybe just coincidence. The task mentioned and another task running > 12 hours without pausing were both gracefully shutdown after > 12 hours wall clock time. So also the task which was paused for 5.5 hours. That task had a total run time of 6 hours and 54 minutes. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177660 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177449 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
What's next? Surviving the overnight host shutdown . . . The jobs survived this long suspension period: 2016-05-22 22:34:00 (6548): Successfully stopped VM. 2016-05-23 07:21:58 (3692): vboxwrapper (7.7.26184): starting The saved job resumed, ended successfully and a new task started. After that 2nd task uploaded the 12 hours wall clock (including the paused time) was over and a gracefull shutdown of the VM was performed. Paused time counting to the elapsed, I already mentioned in the post before, is not to be surprised about. The 2nd task only had time to complete the saved job before the 12 hours wall clock were over. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178718 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178690 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
A longer suspend period of about 8 hours resumed not 100% successful. At least 1 job (out of 4) did not recovered. Out of the logs: 07/30/16 07:49:42 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#3#...) -- perhaps this claim was already removed? 07/30/16 07:49:42 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 07/30/16 07:49:43 Starter pid 202391 exited with status 0 07/30/16 07:49:43 slot4: State change: starter exited 07/30/16 07:49:43 slot4: Changing activity: Busy -> Idle 07/30/16 07:49:43 slot4: State change: START is false 07/30/16 07:49:43 slot4: Changing state and activity: Claimed/Idle -> Preempting/Vacating 07/30/16 07:49:43 slot4: State change: No preempting claim, returning to owner 07/30/16 07:49:43 slot4: Changing state and activity: Preempting/Vacating -> Owner/Idle 07/30/16 07:49:43 slot4: State change: IS_OWNER is false 07/30/16 07:49:43 slot4: Changing state: Owner -> Unclaimed 07/30/16 07:49:43 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#4#...) -- perhaps this claim was already removed? 07/30/16 07:49:43 Error: problem finding resource for 404 (DEACTIVATE_CLAIM_FORCIBLY) 07/30/16 07:44:29 (pid:81562) CCBListener: no activity from CCB server in 33101s; assuming connection is dead. 07/30/16 07:44:29 (pid:81562) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/30/16 07:44:29 (pid:81562) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/30/16 07:44:29 (pid:81562) IO: Failed to read packet header 07/30/16 07:44:29 (pid:81562) Lost connection to shadow, waiting 86300 secs for reconnect 07/30/16 07:45:30 (pid:81562) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1265804 07/30/16 07:49:03 (pid:81562) Got SIGTERM. Performing graceful shutdown. 07/30/16 07:49:03 (pid:81562) ShutdownGraceful all jobs. 07/30/16 07:49:03 (pid:81562) Process exited, pid=81569, signal=15 07/30/16 07:49:05 (pid:81562) About to exec Post script: /var/lib/condor/execute/dir_81562/tarOutput.sh 2016-562372-386 07/30/16 07:49:05 (pid:81562) Create_Process succeeded, pid=240651 07/30/16 07:49:06 (pid:81562) Process exited, pid=240651, status=0 07/30/16 07:49:08 (pid:81562) Failed to send job exit status to shadow 07/30/16 07:49:08 (pid:81562) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 07/30/16 07:49:41 (pid:81562) Accepted request to reconnect from <:0> 07/30/16 07:49:41 (pid:81562) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017> 07/30/16 07:49:41 (pid:81562) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017> 07/30/16 07:49:41 (pid:81562) Recovered connection to shadow after 312 seconds 07/30/16 07:49:42 (pid:81562) Job cleanup finished, now Starter is exiting 07/30/16 07:49:42 (pid:81562) **** condor_starter (condor_STARTER) pid 81562 EXITING WITH STATUS 0 Display update finished (0 histograms, 0 events). 32.1826 pb +- ( 0.0927421 pb = 0.288175 % ) 260000 ( 974374 -> 28.1 % ) integration time: ( 1h 48m 54s elapsed / 20m 57s left ) [22:23:07] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 32.1662 pb +- ( 0.0891388 pb = 0.277119 % ) 280000 ( 1045060 -> 28.2 % ) integration time: ( 1h 57m 30s elapsed / 12m 36s left ) [22:33:21] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). The latter part is the one not running on in slot 3. It was a sherpa already running before the suspend 280 minutes in its initialization/optimization phase. The job in slot 4 seem not to be recovered, but it looks like after reconnecting the job was meanwhile finished. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,185,506 RAC: 2,094 |
Just to say, I've finally got a stable 1-core job running on my home Linux system. Today I've twice stopped BOINC with './boinccmd --quit' to watch a 90-minute downloaded episode of "Beck", and in one case also the BBC News -- the sound gets out of synch with the video otherwise. In both cases when I restarted the Theory app seems to have taken up where it left off, as far as I can tell from the top console. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I really need Theory (and CMS) to survive an all-day shutdown (this means >18h) At present not only is the in progress job lost but, for CMS, if the task hasn't run a job when shut down, it fails error 206 (Condor exited after 73714s without running a job). Like this I can see the reason for the overall timeout and for counting shutdown time towards this, but the combination of both makes CMS and Condor/Theory too inefficient if hosts may be shut down for longer periods. Please consider fixing this somehow, perhaps increase (x2) the timeouts or change the way shutdown time is counted towards this. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I resumed a multi 8-core VM with 8 Theory jobs in it after about 30,000 seconds of shutdown. All connections recovered well. Btw: Interesting how rather fine this VM is running the 8 Theory jobs with only 630 MB base memory and 1024 MB swapfile. The CPU of the host is busy for ~85% with VBoxHeadless.exe. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s. I only found 'Error' messages in the StartLog (was that the 'old' Condor log?) and those messages only appear after a final job in a slot where no new jobs are required cause the VM-lifetime is > 12 hours and waiting to finish the last job in other slots. I rather would call that 'Warning' messages, cause it has no influence on the progress. E.g. from slot 8 and 7: 08/09/16 09:50:52 slot8: Called deactivate_claim_forcibly() 08/09/16 09:50:52 slot8: Got activate claim while starter is still alive. 08/09/16 09:50:52 slot8: Telling shadow to try again later. 08/09/16 09:50:52 Starter pid 180384 exited with status 0 08/09/16 09:50:52 slot8: State change: starter exited 08/09/16 09:50:52 slot8: Changing activity: Busy -> Idle 08/09/16 09:50:52 slot8: State change: START is false 08/09/16 09:50:52 slot8: Changing state and activity: Claimed/Idle -> Preempting/Vacating 08/09/16 09:50:52 slot8: State change: No preempting claim, returning to owner 08/09/16 09:50:52 slot8: Changing state and activity: Preempting/Vacating -> Owner/Idle 08/09/16 09:50:52 slot8: State change: IS_OWNER is false 08/09/16 09:50:52 slot8: Changing state: Owner -> Unclaimed 08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) for 444 (ACTIVATE_CLAIM) 08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) -- perhaps this claim was already removed? 08/09/16 09:50:53 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 08/09/16 09:52:05 slot7: Called deactivate_claim_forcibly() 08/09/16 09:52:05 Starter pid 53107 exited with status 0 08/09/16 09:52:05 slot7: State change: starter exited 08/09/16 09:52:05 slot7: Changing activity: Busy -> Idle 08/09/16 09:52:05 slot7: State change: START is false 08/09/16 09:52:05 slot7: Changing state and activity: Claimed/Idle -> Preempting/Vacating 08/09/16 09:52:05 slot7: State change: No preempting claim, returning to owner 08/09/16 09:52:05 slot7: Changing state and activity: Preempting/Vacating -> Owner/Idle 08/09/16 09:52:05 slot7: State change: IS_OWNER is false 08/09/16 09:52:05 slot7: Changing state: Owner -> Unclaimed 08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) for 444 (ACTIVATE_CLAIM) 08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) -- perhaps this claim was already removed? 08/09/16 09:52:05 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) |
©2024 CERN