Message boards : Theory Application : Suspend/Resume Theory
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3461 - Posted: 21 May 2016, 21:25:40 UTC
Last modified: 21 May 2016, 21:25:58 UTC

Retested it but no success yet. Maybe I've to test it with a fresh task/VM tomorrow.

05/21/16 23:22:41 CCBListener: no activity from CCB server in 20183s; assuming connection is dead.
05/21/16 23:22:41 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/21/16 23:22:41 ERROR: Child pid 4287 appears hung! Killing it hard.
05/21/16 23:22:41 condor_write(): Socket closed when trying to write 23 bytes to <10.0.2.15:53982>, fd is 13, errno=104 Connection reset by peer
05/21/16 23:22:41 Buf::write(): condor_write() failed
05/21/16 23:22:41 AUTHENTICATE: handshake failed!
05/21/16 23:22:41 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using FS
05/21/16 23:22:41 DefaultReaper unexpectedly called on pid 4287, status 9.
05/21/16 23:22:41 The STARTD (pid 4287) was killed because it was no longer responding
05/21/16 23:22:41 Sending obituary for "/usr/sbin/condor_startd"
05/21/16 23:22:41 restarting /usr/sbin/condor_startd in 10 seconds
05/21/16 23:22:41 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
05/21/16 23:22:41 condor_write(): Socket closed when trying to write 1362 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer
05/21/16 23:22:41 Buf::write(): condor_write() failed
05/21/16 23:22:41 Got SIGTERM. Performing graceful shutdown.
05/21/16 23:22:42 All daemons are gone. Exiting.
05/21/16 23:22:42 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99
ID: 3461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3465 - Posted: 22 May 2016, 5:30:28 UTC

A second task I started yesterday and tried to survive overnight without success.

No lines added to StarterLog
Last lines just before suspend and after resume this morning from StartLog:

05/21/16 22:49:50 Changing activity: Idle -> Busy
05/21/16 23:20:04 condor_write(): Socket closed when trying to write 2849 bytes to collector alicondor01.cern.ch, fd is 9, errno=104 Connection reset by peer
05/21/16 23:20:04 Buf::write(): condor_write() failed
05/22/16 07:06:58 CCBListener: no activity from CCB server in 28028s; assuming connection is dead.
05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/22/16 07:06:58 ERROR: Child pid 24802 appears hung! Killing it hard.
05/22/16 07:06:58 Starter pid 24802 died on signal 9 (signal 9 (Killed))


Lines from MasterLog after the resume:

05/22/16 07:06:58 CCBListener: no activity from CCB server in 28290s; assuming connection is dead.
05/22/16 07:06:58 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/22/16 07:06:58 condor_write(): Socket closed when trying to write 1371 bytes to collector alicondor01.cern.ch, fd is 10, errno=104 Connection reset by peer
05/22/16 07:06:58 Buf::write(): condor_write() failed
05/22/16 07:06:58 ERROR: Child pid 4286 appears hung! Killing it hard.
05/22/16 07:06:58 condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:37612>.
05/22/16 07:06:58 IO: Failed to read packet header
05/22/16 07:06:58 AUTHENTICATE: handshake failed!
05/22/16 07:06:58 DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake
05/22/16 07:06:58 DefaultReaper unexpectedly called on pid 4286, status 9.
05/22/16 07:06:58 The STARTD (pid 4286) was killed because it was no longer responding
05/22/16 07:06:58 Sending obituary for "/usr/sbin/condor_startd"
05/22/16 07:06:58 restarting /usr/sbin/condor_startd in 10 seconds
05/22/16 07:06:58 The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
05/22/16 07:06:59 Got SIGTERM. Performing graceful shutdown.
05/22/16 07:06:59 All daemons are gone. Exiting.
05/22/16 07:06:59 **** condor_master (condor_MASTER) pid 4268 EXITING WITH STATUS 99


I suspended a new task started this morning.
ID: 3465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3467 - Posted: 22 May 2016, 11:04:45 UTC - in response to Message 3465.  

I suspended a new task started this morning.

I resumed the task after 5.5 hours pausing (LAIM off).

This time the job restarted processing the events (thumbs up) and hopefully also a new job will arrive after current one is finished.

From StarterLog:

05/22/16 07:15:13 Create_Process succeeded, pid=4327
05/22/16 12:52:05 CCBListener: no activity from CCB server in 20212s; assuming connection is dead.
05/22/16 12:52:05 CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
05/22/16 12:52:05 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
05/22/16 12:52:05 IO: Failed to read packet header
05/22/16 12:53:06 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#53910
05/22/16 12:57:06 condor_write(): Socket closed when trying to write 404 bytes to <188.184.187.167:9618>, fd is 8
05/22/16 12:57:06 Buf::write(): condor_write() failed
05/22/16 12:57:19 Accepted request to reconnect from <188.184.187.167:9618>
05/22/16 12:57:19 Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780>
05/22/16 12:57:19 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_5780>


What's next? Surviving the overnight host shutdown . . .
ID: 3467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3468 - Posted: 22 May 2016, 20:32:23 UTC - in response to Message 3467.  

I resumed the task after 5.5 hours pausing (LAIM off).

Interesting observation, but maybe just coincidence.

The task mentioned and another task running > 12 hours without pausing were both gracefully shutdown after > 12 hours wall clock time.
So also the task which was paused for 5.5 hours. That task had a total run time of 6 hours and 54 minutes.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177660
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177449
ID: 3468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3470 - Posted: 23 May 2016, 6:56:45 UTC - in response to Message 3467.  

What's next? Surviving the overnight host shutdown . . .

The jobs survived this long suspension period:

2016-05-22 22:34:00 (6548): Successfully stopped VM.
2016-05-23 07:21:58 (3692): vboxwrapper (7.7.26184): starting


The saved job resumed, ended successfully and a new task started.
After that 2nd task uploaded the 12 hours wall clock (including the paused time) was over and a gracefull shutdown of the VM was performed.
Paused time counting to the elapsed, I already mentioned in the post before, is not to be surprised about.
The 2nd task only had time to complete the saved job before the 12 hours wall clock were over.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178718
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=178690
ID: 3470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 3874 - Posted: 30 Jul 2016, 7:36:05 UTC

A longer suspend period of about 8 hours resumed not 100% successful.
At least 1 job (out of 4) did not recovered.

Out of the logs:
07/30/16 07:49:42 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#3#...) -- perhaps this claim was already removed?
07/30/16 07:49:42 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
07/30/16 07:49:43 Starter pid 202391 exited with status 0
07/30/16 07:49:43 slot4: State change: starter exited
07/30/16 07:49:43 slot4: Changing activity: Busy -> Idle
07/30/16 07:49:43 slot4: State change: START is false
07/30/16 07:49:43 slot4: Changing state and activity: Claimed/Idle -> Preempting/Vacating
07/30/16 07:49:43 slot4: State change: No preempting claim, returning to owner
07/30/16 07:49:43 slot4: Changing state and activity: Preempting/Vacating -> Owner/Idle
07/30/16 07:49:43 slot4: State change: IS_OWNER is false
07/30/16 07:49:43 slot4: Changing state: Owner -> Unclaimed
07/30/16 07:49:43 Error: can't find resource with ClaimId (<10.0.2.15:45490>#1469796145#4#...) -- perhaps this claim was already removed?
07/30/16 07:49:43 Error: problem finding resource for 404 (DEACTIVATE_CLAIM_FORCIBLY)


07/30/16 07:44:29 (pid:81562) CCBListener: no activity from CCB server in 33101s; assuming connection is dead.
07/30/16 07:44:29 (pid:81562) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/30/16 07:44:29 (pid:81562) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
07/30/16 07:44:29 (pid:81562) IO: Failed to read packet header
07/30/16 07:44:29 (pid:81562) Lost connection to shadow, waiting 86300 secs for reconnect
07/30/16 07:45:30 (pid:81562) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1265804
07/30/16 07:49:03 (pid:81562) Got SIGTERM. Performing graceful shutdown.
07/30/16 07:49:03 (pid:81562) ShutdownGraceful all jobs.
07/30/16 07:49:03 (pid:81562) Process exited, pid=81569, signal=15
07/30/16 07:49:05 (pid:81562) About to exec Post script: /var/lib/condor/execute/dir_81562/tarOutput.sh 2016-562372-386
07/30/16 07:49:05 (pid:81562) Create_Process succeeded, pid=240651
07/30/16 07:49:06 (pid:81562) Process exited, pid=240651, status=0
07/30/16 07:49:08 (pid:81562) Failed to send job exit status to shadow
07/30/16 07:49:08 (pid:81562) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
07/30/16 07:49:41 (pid:81562) Accepted request to reconnect from <:0>
07/30/16 07:49:41 (pid:81562) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017>
07/30/16 07:49:41 (pid:81562) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_499017>
07/30/16 07:49:41 (pid:81562) Recovered connection to shadow after 312 seconds
07/30/16 07:49:42 (pid:81562) Job cleanup finished, now Starter is exiting
07/30/16 07:49:42 (pid:81562) **** condor_starter (condor_STARTER) pid 81562 EXITING WITH STATUS 0


Display update finished (0 histograms, 0 events).
32.1826 pb +- ( 0.0927421 pb = 0.288175 % ) 260000 ( 974374 -> 28.1 % )
integration time: ( 1h 48m 54s elapsed / 20m 57s left ) [22:23:07]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
32.1662 pb +- ( 0.0891388 pb = 0.277119 % ) 280000 ( 1045060 -> 28.2 % )
integration time: ( 1h 57m 30s elapsed / 12m 36s left ) [22:33:21]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

The latter part is the one not running on in slot 3. It was a sherpa already running before the suspend 280 minutes in its initialization/optimization phase.
The job in slot 4 seem not to be recovered, but it looks like after reconnecting the job was meanwhile finished.
ID: 3874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 3915 - Posted: 31 Jul 2016, 21:24:21 UTC

Just to say, I've finally got a stable 1-core job running on my home Linux system. Today I've twice stopped BOINC with './boinccmd --quit' to watch a 90-minute downloaded episode of "Beck", and in one case also the BBC News -- the sound gets out of synch with the video otherwise. In both cases when I restarted the Theory app seems to have taken up where it left off, as far as I can tell from the top console.
ID: 3915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 3924 - Posted: 2 Aug 2016, 9:32:08 UTC

I really need Theory (and CMS) to survive an all-day shutdown (this means >18h)
At present not only is the in progress job lost but, for CMS, if the task hasn't run a job when shut down, it fails error 206 (Condor exited after 73714s without running a job). Like this
I can see the reason for the overall timeout and for counting shutdown time towards this, but the combination of both makes CMS and Condor/Theory too inefficient if hosts may be shut down for longer periods.
Please consider fixing this somehow, perhaps increase (x2) the timeouts or change the way shutdown time is counted towards this.
ID: 3924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 4021 - Posted: 9 Aug 2016, 6:04:03 UTC

I resumed a multi 8-core VM with 8 Theory jobs in it after about 30,000 seconds of shutdown. All connections recovered well.

Btw: Interesting how rather fine this VM is running the 8 Theory jobs with only 630 MB base memory and 1024 MB swapfile.
The CPU of the host is busy for ~85% with VBoxHeadless.exe.
ID: 4021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 4024 - Posted: 9 Aug 2016, 8:24:40 UTC - in response to Message 4021.  

So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s.
ID: 4024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 4028 - Posted: 9 Aug 2016, 9:35:03 UTC - in response to Message 4024.  

So just to confirm, the error message in the Condor log has gone? Some tuning may be required to get the CPU usage in the high 90s.

I only found 'Error' messages in the StartLog (was that the 'old' Condor log?)
and those messages only appear after a final job in a slot where no new jobs are required cause the VM-lifetime is > 12 hours and waiting to finish the last job in other slots.
I rather would call that 'Warning' messages, cause it has no influence on the progress.

E.g. from slot 8 and 7:

08/09/16 09:50:52 slot8: Called deactivate_claim_forcibly()
08/09/16 09:50:52 slot8: Got activate claim while starter is still alive.
08/09/16 09:50:52 slot8: Telling shadow to try again later.
08/09/16 09:50:52 Starter pid 180384 exited with status 0
08/09/16 09:50:52 slot8: State change: starter exited
08/09/16 09:50:52 slot8: Changing activity: Busy -> Idle
08/09/16 09:50:52 slot8: State change: START is false
08/09/16 09:50:52 slot8: Changing state and activity: Claimed/Idle -> Preempting/Vacating
08/09/16 09:50:52 slot8: State change: No preempting claim, returning to owner
08/09/16 09:50:52 slot8: Changing state and activity: Preempting/Vacating -> Owner/Idle
08/09/16 09:50:52 slot8: State change: IS_OWNER is false
08/09/16 09:50:52 slot8: Changing state: Owner -> Unclaimed
08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) for 444 (ACTIVATE_CLAIM)
08/09/16 09:50:53 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#8#...) -- perhaps this claim was already removed?
08/09/16 09:50:53 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
08/09/16 09:52:05 slot7: Called deactivate_claim_forcibly()
08/09/16 09:52:05 Starter pid 53107 exited with status 0
08/09/16 09:52:05 slot7: State change: starter exited
08/09/16 09:52:05 slot7: Changing activity: Busy -> Idle
08/09/16 09:52:05 slot7: State change: START is false
08/09/16 09:52:05 slot7: Changing state and activity: Claimed/Idle -> Preempting/Vacating
08/09/16 09:52:05 slot7: State change: No preempting claim, returning to owner
08/09/16 09:52:05 slot7: Changing state and activity: Preempting/Vacating -> Owner/Idle
08/09/16 09:52:05 slot7: State change: IS_OWNER is false
08/09/16 09:52:05 slot7: Changing state: Owner -> Unclaimed
08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) for 444 (ACTIVATE_CLAIM)
08/09/16 09:52:05 Error: can't find resource with ClaimId (<10.0.2.15:38741>#1470684109#7#...) -- perhaps this claim was already removed?
08/09/16 09:52:05 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
ID: 4028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Theory Application : Suspend/Resume Theory


©2024 CERN