Message boards : CMS Application : Suspend/Resume/Disconnect
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
I suspended a dual core CMS-task, shutdown the host for >30,000 seconds and resumed the VM this morning. The job in slot1 reconnected, but the job in slot2 didn't. 08/09/16 20:55:11 (pid:7590) Running job as user nobody 08/09/16 20:55:11 (pid:7590) Create_Process succeeded, pid=7676 08/10/16 07:33:45 (pid:7590) CCBListener: no activity from CCB server in 30815s; assuming connection is dead. 08/10/16 07:33:45 (pid:7590) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/10/16 07:33:45 (pid:7590) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/10/16 07:33:45 (pid:7590) IO: Failed to read packet header 08/10/16 07:33:45 (pid:7590) Lost connection to shadow, waiting 7200 secs for reconnect 08/10/16 07:34:46 (pid:7590) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#526801 08/09/16 20:55:10 (pid:7588) Running job as user nobody 08/09/16 20:55:10 (pid:7588) Create_Process succeeded, pid=7597 08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 08/09/16 23:00:40 (pid:7588) IO: Failed to read packet header 08/09/16 23:00:40 (pid:7588) Lost connection to shadow, waiting 7200 secs for reconnect 08/10/16 07:33:44 (pid:7588) CCBListener: no activity from CCB server in 30814s; assuming connection is dead. 08/10/16 07:33:44 (pid:7588) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/10/16 07:33:44 (pid:7588) No reconnect from shadow for 30784 seconds, aborting job execution! 08/10/16 07:33:44 (pid:7588) **** condor_starter (condor_STARTER) pid 7588 EXITING WITH STATUS 2 |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. It seems strange that we have three log entries at 23:00:40. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
08/09/16 23:00:40 (pid:7588) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. Not sure about the time stamps, cause the guest clock is during the VM lifetime slowly behind, but on the host I suspended the BOINC-task (LAIM off) at 09-Aug-2016 23:00:29 [vLHCathome-dev] task CMS_23407_1470701574.163651_0 suspended by user and ended BOINC's client 09-Aug-2016 23:01:02 [---] Exiting If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
If you like I good repeat the dual CMS test with suspension for a longer period (LAIM off) without taking BOINC down to see whether this is reproducible. No, for now I am mainly concern with this working with single core VMs. These kind of errors do need chasing up but when the bigger problems have been addressed. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
... for now I am mainly concern with this working with single core VMs. After a resume of a single core CMS-VM, I get in the StarterLog Lost connection to shadow, waiting 7200 secs for reconnect With the Theory-VM this value is 86300, so I'm afraid that suspending longer than 2 hours, will lead to no proper recovering of the saved running job. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
After pausing the VM (LAIM off) for 8000 seconds the job and connection recovered fine after resume. From StartLog: 08/11/16 12:49:14 CCBListener: no activity from CCB server in 8194s; assuming connection is dead. 08/11/16 12:49:14 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 08/11/16 12:49:48 condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 6, errno=104 Connection reset by peer 08/11/16 12:49:48 Buf::write(): condor_write() failed 08/11/16 12:50:15 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#534331 |
©2025 CERN