Suspend/Resume Theory

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3461 - Posted: 21 May 2016, 21:25:40 UTC Last modified: 21 May 2016, 21:25:58 UTC Retested it but no success yet. Maybe I've to test it with a fresh task/VM tomorrow. 05/21/16 23:22:41 CCBListener: no activity from CCB server in 20183s; assuming connection is dead. 05/21/16 23:22:41 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/21/16 23:22:41 ERROR: Child pid 4287 appears hung! Killing it hard. 05/21/16 23:22:41 condor_write(): Socket closed when trying to write 23 bytes to <10.0.2.15:53982>, fd is 13, errno=104 Connection reset by peer 05/21/16 23:22:41 Buf::write(): condor_write() failed 05/21/16 23:22:41 AUTHENTICATE: handshake failed! 05/21/16 23:22:41 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake\|AUTHENTICATE:1004:Failed to authenticate using FS 05/21/16 23:22:41 DefaultReaper unexpectedly called on pid 4287, status 9. 05/21/16 23:22:41 The STARTD (pid 4287) was killed because it was no longer responding 05/21/16 23:22:41 Sending obituary for "/usr/sbin/condor_startd" 05/21/16 23:22:41 restarting /usr/sbin/condor_startd in 10 seconds 05/21/16 23:22:41 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/21/16 23:22:41 condor_write(): Socket closed when trying to write 1362 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer 05/21/16 23:22:41 Buf::write(): condor_write() failed 05/21/16 23:22:41 Got SIGTERM. Performing graceful shutdown. 05/21/16 23:22:42 All daemons are gone. Exiting. 05/21/16 23:22:42 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99 ID: 3461 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3465 - Posted: 22 May 2016, 5:30:28 UTC A second task I started yesterday and tried to survive overnight without success. No lines added to StarterLog Last lines just before suspend and after resume this morning from StartLog: 05/21/16 22:49:50 Changing activity: Idle -> Busy 05/21/16 23:20:04 condor_write(): Socket closed when trying to write 2849 bytes to collector alicondor01.cern.ch, fd is 9, errno=104 Connection reset by peer 05/21/16 23:20:04 Buf::write(): condor_write() failed 05/22/16 07:06:58 CCBListener: no activity from CCB server in 28028s; assuming connection is dead. 05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 07:06:58 ERROR: Child pid 24802 appears hung! Killing it hard. 05/22/16 07:06:58 Starter pid 24802 died on signal 9 (signal 9 (Killed)) Lines from MasterLog after the resume: 05/22/16 07:06:58 CCBListener: no activity from CCB server in 28290s; assuming connection is dead. 05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 07:06:58 condor_write(): Socket closed when trying to write 1371 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer 05/22/16 07:06:58 Buf::write(): condor_write() failed 05/22/16 07:06:58 ERROR: Child pid 4286 appears hung! Killing it hard. 05/22/16 07:06:58 condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:37612>. 05/22/16 07:06:58 IO: Failed to read packet header 05/22/16 07:06:58 AUTHENTICATE: handshake failed! 05/22/16 07:06:58 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake 05/22/16 07:06:58 DefaultReaper unexpectedly called on pid 4286, status 9. 05/22/16 07:06:58 The STARTD (pid 4286) was killed because it was no longer responding 05/22/16 07:06:58 Sending obituary for "/usr/sbin/condor_startd" 05/22/16 07:06:58 restarting /usr/sbin/condor_startd in 10 seconds 05/22/16 07:06:58 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 05/22/16 07:06:59 Got SIGTERM. Performing graceful shutdown. 05/22/16 07:06:59 All daemons are gone. Exiting. 05/22/16 07:06:59 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99 I suspended a new task started this morning. ID: 3465 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3467 - Posted: 22 May 2016, 11:04:45 UTC - in response to Message 3465. I suspended a new task started this morning. I resumed the task after 5.5 hours pausing (LAIM off). This time the job restarted processing the events (thumbs up) and hopefully also a new job will arrive after current one is finished. From StarterLog: 05/22/16 07:15:13 Create_Process succeeded, pid=4327 05/22/16 12:52:05 CCBListener: no activity from CCB server in 20212s; assuming connection is dead. 05/22/16 12:52:05 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 05/22/16 12:52:05 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 05/22/16 12:52:05 IO: Failed to read packet header 05/22/16 12:53:06 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53910 05/22/16 12:57:06 condor_write(): Socket closed when trying to write 404 bytes to <188.184.187.167:9618>, fd is 8 05/22/16 12:57:06 Buf::write(): condor_write() failed 05/22/16 12:57:19 Accepted request to reconnect from <188.184.187.167:9618> 05/22/16 12:57:19 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780> 05/22/16 12:57:19 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780> What's next? Surviving the overnight host shutdown . . . ID: 3467 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3468 - Posted: 22 May 2016, 20:32:23 UTC - in response to Message 3467. I resumed the task after 5.5 hours pausing (LAIM off). Interesting observation, but maybe just coincidence. The task mentioned and another task running > 12 hours without pausing were both gracefully shutdown after > 12 hours wall clock time. So also the task which was paused for 5.5 hours. That task had a total run time of 6 hours and 54 minutes. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177660 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177449 ID: 3468 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3470 - Posted: 23 May 2016, 6:56:45 UTC - in response to Message 3467. What's next? Surviving the overnight host shutdown . . . The jobs survived this long suspension period: 2016-05-22 22:34:00 (6548): Successfully stopped VM. 2016-05-23 07:21:58 (3692): vboxwrapper (7.7.26184): starting The saved job resumed, ended successfully and a new task started. After that 2nd task uploaded the 12 hours wall clock (including the paused time) was over and a gracefull shutdown of the VM was performed. Paused time counting to the elapsed, I already mentioned in the post before, is not to be surprised about. The 2nd task only had time to complete the saved job before the 12 hours wall clock were over. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178718 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178690 ID: 3470 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 3874 - Posted: 30 Jul 2016, 7:36:05 UTC A longer suspend period of about 8 hours resumed not 100% successful. At least 1 job (out of 4) did not recovered. Out of the logs: 07/30/16 07:49:42 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#3#...) -- perhaps this claim was already removed? 07/30/16 07:49:42 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 07/30/16 07:49:43 Starter pid 202391 exited with status 0 07/30/16 07:49:43 slot4: State change: starter exited 07/30/16 07:49:43 slot4: Changing activity: Busy -> Idle 07/30/16 07:49:43 slot4: State change: START is false 07/30/16 07:49:43 slot4: Changing state and activity: Claimed/Idle -> Preempting/Vacating 07/30/16 07:49:43 slot4: State change: No preempting claim, returning to owner 07/30/16 07:49:43 slot4: Changing state and activity: Preempting/Vacating -> Owner/Idle 07/30/16 07:49:43 slot4: State change: IS_OWNER is false 07/30/16 07:49:43 slot4: Changing state: Owner -> Unclaimed 07/30/16 07:49:43 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#4#...) -- perhaps this claim was already removed? 07/30/16 07:49:43 Error: problem finding resource for 404 (DEACTIVATE_CLAIM_FORCIBLY) 07/30/16 07:44:29 (pid:81562) CCBListener: no activity from CCB server in 33101s; assuming connection is dead. 07/30/16 07:44:29 (pid:81562) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds. 07/30/16 07:44:29 (pid:81562) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>. 07/30/16 07:44:29 (pid:81562) IO: Failed to read packet header 07/30/16 07:44:29 (pid:81562) Lost connection to shadow, waiting 86300 secs for reconnect 07/30/16 07:45:30 (pid:81562) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1265804 07/30/16 07:49:03 (pid:81562) Got SIGTERM. Performing graceful shutdown. 07/30/16 07:49:03 (pid:81562) ShutdownGraceful all jobs. 07/30/16 07:49:03 (pid:81562) Process exited, pid=81569, signal=15 07/30/16 07:49:05 (pid:81562) About to exec Post script: /var/lib/condor/execute/dir_81562/tarOutput.sh 2016-562372-386 07/30/16 07:49:05 (pid:81562) Create_Process succeeded, pid=240651 07/30/16 07:49:06 (pid:81562) Process exited, pid=240651, status=0 07/30/16 07:49:08 (pid:81562) Failed to send job exit status to shadow 07/30/16 07:49:08 (pid:81562) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 07/30/16 07:49:41 (pid:81562) Accepted request to reconnect from <:0> 07/30/16 07:49:41 (pid:81562) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017> 07/30/16 07:49:41 (pid:81562) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017> 07/30/16 07:49:41 (pid:81562) Recovered connection to shadow after 312 seconds 07/30/16 07:49:42 (pid:81562) Job cleanup finished, now Starter is exiting 07/30/16 07:49:42 (pid:81562) **** condor_starter (condor_STARTER) pid 81562 EXITING WITH STATUS 0 Display update finished (0 histograms, 0 events). 32.1826 pb +- ( 0.0927421 pb = 0.288175 % ) 260000 ( 974374 -> 28.1 % ) integration time: ( 1h 48m 54s elapsed / 20m 57s left ) [22:23:07] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 32.1662 pb +- ( 0.0891388 pb = 0.277119 % ) 280000 ( 1045060 -> 28.2 % ) integration time: ( 1h 57m 30s elapsed / 12m 36s left ) [22:33:21] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). The latter part is the one not running on in slot 3. It was a sherpa already running before the suspend 280 minutes in its initialization/optimization phase. The job in slot 4 seem not to be recovered, but it looks like after reconnecting the job was meanwhile finished. ID: 3874 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1138 Credit: 8,073,333 RAC: 2,670	Message 3915 - Posted: 31 Jul 2016, 21:24:21 UTC Just to say, I've finally got a stable 1-core job running on my home Linux system. Today I've twice stopped BOINC with './boinccmd --quit' to watch a 90-minute downloaded episode of "Beck", and in one case also the BBC News -- the sound gets out of synch with the video otherwise. In both cases when I restarted the Theory app seems to have taken up where it left off, as far as I can tell from the top console. ID: 3915 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 3924 - Posted: 2 Aug 2016, 9:32:08 UTC I really need Theory (and CMS) to survive an all-day shutdown (this means >18h) At present not only is the in progress job lost but, for CMS, if the task hasn't run a job when shut down, it fails error 206 (Condor exited after 73714s without running a job). Like this I can see the reason for the overall timeout and for counting shutdown time towards this, but the combination of both makes CMS and Condor/Theory too inefficient if hosts may be shut down for longer periods. Please consider fixing this somehow, perhaps increase (x2) the timeouts or change the way shutdown time is counted towards this. ID: 3924 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 4021 - Posted: 9 Aug 2016, 6:04:03 UTC I resumed a multi 8-core VM with 8 Theory jobs in it after about 30,000 seconds of shutdown. All connections recovered well. Btw: Interesting how rather fine this VM is running the 8 Theory jobs with only 630 MB base memory and 1024 MB swapfile. The CPU of the host is busy for ~85% with VBoxHeadless.exe. ID: 4021 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 2	Message 4024 - Posted: 9 Aug 2016, 8:24:40 UTC - in response to Message 4021. So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s. ID: 4024 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 4028 - Posted: 9 Aug 2016, 9:35:03 UTC - in response to Message 4024. So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s. I only found 'Error' messages in the StartLog (was that the 'old' Condor log?) and those messages only appear after a final job in a slot where no new jobs are required cause the VM-lifetime is > 12 hours and waiting to finish the last job in other slots. I rather would call that 'Warning' messages, cause it has no influence on the progress. E.g. from slot 8 and 7: 08/09/16 09:50:52 slot8: Called deactivate_claim_forcibly() 08/09/16 09:50:52 slot8: Got activate claim while starter is still alive. 08/09/16 09:50:52 slot8: Telling shadow to try again later. 08/09/16 09:50:52 Starter pid 180384 exited with status 0 08/09/16 09:50:52 slot8: State change: starter exited 08/09/16 09:50:52 slot8: Changing activity: Busy -> Idle 08/09/16 09:50:52 slot8: State change: START is false 08/09/16 09:50:52 slot8: Changing state and activity: Claimed/Idle -> Preempting/Vacating 08/09/16 09:50:52 slot8: State change: No preempting claim, returning to owner 08/09/16 09:50:52 slot8: Changing state and activity: Preempting/Vacating -> Owner/Idle 08/09/16 09:50:52 slot8: State change: IS_OWNER is false 08/09/16 09:50:52 slot8: Changing state: Owner -> Unclaimed 08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) for 444 (ACTIVATE_CLAIM) 08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) -- perhaps this claim was already removed? 08/09/16 09:50:53 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) 08/09/16 09:52:05 slot7: Called deactivate_claim_forcibly() 08/09/16 09:52:05 Starter pid 53107 exited with status 0 08/09/16 09:52:05 slot7: State change: starter exited 08/09/16 09:52:05 slot7: Changing activity: Busy -> Idle 08/09/16 09:52:05 slot7: State change: START is false 08/09/16 09:52:05 slot7: Changing state and activity: Claimed/Idle -> Preempting/Vacating 08/09/16 09:52:05 slot7: State change: No preempting claim, returning to owner 08/09/16 09:52:05 slot7: Changing state and activity: Preempting/Vacating -> Owner/Idle 08/09/16 09:52:05 slot7: State change: IS_OWNER is false 08/09/16 09:52:05 slot7: Changing state: Owner -> Unclaimed 08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) for 444 (ACTIVATE_CLAIM) 08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) -- perhaps this claim was already removed? 08/09/16 09:52:05 Error: problem finding resource for 403 (DEACTIVATE_CLAIM) ID: 4028 · Rating: 0 · rate: / Reply Quote

Development for LHC@home