Thread 'Suspend/Resume/Disconnect'

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4040 - Posted: 10 Aug 2016, 5:46:57 UTC I suspended a dual core CMS-task, shutdown the host for >30,000 seconds and resumed the VM this morning. The job in slot1 reconnected, but the job in slot2 didn't. 08/09/16 20:55:11 (pid:7590) Running job as user nobody 08/09/16 20:55:11 (pid:7590) Create_Process succeeded, pid=7676 08/10/16 07:33:45 (pid:7590) CCBListener: no activity from CCB server in 30815s; assuming connection is dead. 08/10/16 07:33:45 (pid:7590) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/10/16 07:33:45 (pid:7590) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/10/16 07:33:45 (pid:7590) IO: Failed to read packet header 08/10/16 07:33:45 (pid:7590) Lost connection to shadow, waiting 7200 secs for reconnect 08/10/16 07:34:46 (pid:7590) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#526801 08/09/16 20:55:10 (pid:7588) Running job as user nobody 08/09/16 20:55:10 (pid:7588) Create_Process succeeded, pid=7597 08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header 08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect 08/10/16 07:33:44 (pid:7588) CCBListener: no activity from CCB server in 30814s; assuming connection is dead. 08/10/16 07:33:44 (pid:7588) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/10/16 07:33:44 (pid:7588) No reconnect from shadow for 30784 seconds, aborting job execution! 08/10/16 07:33:44 (pid:7588) **** condor_starter (condor_STARTER) pid 7588 EXITING WITH STATUS 2 ID: 4040 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4042 - Posted: 10 Aug 2016, 7:36:31 UTC - in response to Message 4040. 08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header 08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect It seems strange that we have three log entries at 23:00:40. ID: 4042 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4043 - Posted: 10 Aug 2016, 10:45:13 UTC - in response to Message 4042. 08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header 08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect It seems strange that we have three log entries at 23:00:40. Not sure about the time stamps, cause the guest clock is during the VM lifetime slowly behind, but on the host I suspended the BOINC-task (LAIM off) at 09-Aug-2016 23:00:29 [vLHCathome-dev] task CMS_23407_1470701574.163651_0 suspended by user and ended BOINC's client 09-Aug-2016 23:01:02 [---] Exiting If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible. ID: 4043 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4046 - Posted: 10 Aug 2016, 11:36:41 UTC - in response to Message 4043. If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible. No, for now I am mainly concern with this working with single core VMs. These kind of errors do need chasing up but when the bigger problems have been addressed. ID: 4046 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4051 - Posted: 11 Aug 2016, 7:44:23 UTC - in response to Message 4046. ... for now I am mainly concern with this working with single core VMs. After a resume of a single core CMS-VM, I get in the StarterLog Lost connection to shadow, waiting 7200* secs for reconnect* With the Theory-VM this value is 86300, so I'm afraid that suspending longer than 2 hours, will lead to no proper recovering of the saved running job. ID: 4051 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4053 - Posted: 11 Aug 2016, 11:14:31 UTC - in response to Message 4051. After pausing the VM (LAIM off) for 8000 seconds the job and connection recovered fine after resume. From StartLog: 08/11/16 12:49:14 CCBListener: no activity from CCB server in 8194s; assuming connection is dead. 08/11/16 12:49:14 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/11/16 12:49:48 condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 6, errno=104 Connection reset by peer 08/11/16 12:49:48 Buf::write(): condor_write() failed 08/11/16 12:50:15 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#534331 ID: 4053 · Rating: 0 · rate: / Reply Quote

Development for LHC@home