Message boards : CMS Application : Suspend/Resume/Disconnect
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 4040 - Posted: 10 Aug 2016, 5:46:57 UTC

I suspended a dual core CMS-task, shutdown the host for >30,000 seconds and resumed the VM this morning.
The job in slot1 reconnected, but the job in slot2 didn't.

08/09/16 20:55:11 (pid:7590) Running job as user nobody
08/09/16 20:55:11 (pid:7590) Create_Process succeeded, pid=7676
08/10/16 07:33:45 (pid:7590) CCBListener: no activity from CCB server in 30815s; assuming connection is dead.
08/10/16 07:33:45 (pid:7590) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
08/10/16 07:33:45 (pid:7590) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
08/10/16 07:33:45 (pid:7590) IO: Failed to read packet header
08/10/16 07:33:45 (pid:7590) Lost connection to shadow, waiting 7200 secs for reconnect
08/10/16 07:34:46 (pid:7590) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#526801


08/09/16 20:55:10 (pid:7588) Running job as user nobody
08/09/16 20:55:10 (pid:7588) Create_Process succeeded, pid=7597
08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header
08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect
08/10/16 07:33:44 (pid:7588) CCBListener: no activity from CCB server in 30814s; assuming connection is dead.
08/10/16 07:33:44 (pid:7588) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
08/10/16 07:33:44 (pid:7588) No reconnect from shadow for 30784 seconds, aborting job execution!
08/10/16 07:33:44 (pid:7588) **** condor_starter (condor_STARTER) pid 7588 EXITING WITH STATUS 2
ID: 4040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 4042 - Posted: 10 Aug 2016, 7:36:31 UTC - in response to Message 4040.  

08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header
08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect



It seems strange that we have three log entries at 23:00:40.
ID: 4042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 4043 - Posted: 10 Aug 2016, 10:45:13 UTC - in response to Message 4042.  

08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header
08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect


It seems strange that we have three log entries at 23:00:40.

Not sure about the time stamps, cause the guest clock is during the VM lifetime slowly behind, but on the host I suspended the BOINC-task (LAIM off) at

09-Aug-2016 23:00:29 [vLHCathome-dev] task CMS_23407_1470701574.163651_0 suspended by user

and ended BOINC's client

09-Aug-2016 23:01:02 [---] Exiting

If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible.
ID: 4043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 4046 - Posted: 10 Aug 2016, 11:36:41 UTC - in response to Message 4043.  

If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible.


No, for now I am mainly concern with this working with single core VMs. These kind of errors do need chasing up but when the bigger problems have been addressed.
ID: 4046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 4051 - Posted: 11 Aug 2016, 7:44:23 UTC - in response to Message 4046.  

... for now I am mainly concern with this working with single core VMs.

After a resume of a single core CMS-VM, I get in the StarterLog

Lost connection to shadow, waiting 7200 secs for reconnect

With the Theory-VM this value is 86300, so I'm afraid that suspending longer than 2 hours, will lead to no proper recovering of the saved running job.
ID: 4051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 4053 - Posted: 11 Aug 2016, 11:14:31 UTC - in response to Message 4051.  

After pausing the VM (LAIM off) for 8000 seconds the job and connection recovered fine after resume.

From StartLog:

08/11/16 12:49:14 CCBListener: no activity from CCB server in 8194s; assuming connection is dead.
08/11/16 12:49:14 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
08/11/16 12:49:48 condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 6, errno=104 Connection reset by peer
08/11/16 12:49:48 Buf::write(): condor_write() failed
08/11/16 12:50:15 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#534331
ID: 4053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : Suspend/Resume/Disconnect


©2024 CERN