Suspend/Resume

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2066 - Posted: 19 Feb 2016, 19:24:59 UTC - in response to Message 2064. Last modified: 19 Feb 2016, 19:33:57 UTC Still does not work. Suspend for 17min is ok. Suspend for 23 min-restart from run-1 (was at run-4) Condor not starting: Masterlog 02/19/16 20:11:36 (pid:7718) **************************************************** 02/19/16 20:11:36 (pid:7718) condor_master (CONDOR_MASTER) STARTING UP 02/19/16 20:11:36 (pid:7718) /home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_master EDIT:run-2started correctly 02/19/16 20:11:36 (pid:7718) SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 02/19/16 20:11:36 (pid:7718) Configuration: subsystem:MASTER local:<NONE> class:DAEMON 02/19/16 20:11:36 (pid:7718) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 02/19/16 20:11:36 (pid:7718) $CondorPlatform: x86_64_RedHat5 $ 02/19/16 20:11:36 (pid:7718) PID = 7718 02/19/16 20:11:36 (pid:7718) Log last touched time unavailable (No such file or directory) 02/19/16 20:11:36 (pid:7718) ************************************************** 02/19/16 20:11:36 (pid:7718) Using config source: /home/boinc/CMSRun/glide_s1xJ7X/condor_config 02/19/16 20:11:36 (pid:7718) config Macros = 215, Sorted = 215, StringBytes = 10629, TablesBytes = 7780 02/19/16 20:11:36 (pid:7718) CLASSAD_CACHING is OFF 02/19/16 20:11:36 (pid:7718) Daemon Log is logging: D_ALWAYS D_ERROR 02/19/16 20:11:36 (pid:7718) DaemonCore: command socket at <10.0.2.15:41670?noUDP> 02/19/16 20:11:36 (pid:7718) DaemonCore: private command socket at <10.0.2.15:41670> 02/19/16 20:11:36 (pid:7718) CCBListener: heartbeat disabled because interval is configured to be 0 02/19/16 20:11:36 (pid:7718) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#119419 02/19/16 20:11:36 (pid:7718) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_master (mtime:1455909077) 02/19/16 20:11:36 (pid:7718) Started DaemonCore process "/home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_startd", pid and pgroup = 7721 02/19/16 20:12:38 (pid:7718) DefaultReaper unexpectedly called on pid 7721, status 25344. 02/19/16 20:12:38 (pid:7718) The STARTD (pid 7721) exited with status 99 (daemon will not restart automatically) 02/19/16 20:17:01 (pid:7718) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 02/19/16 20:17:01 (pid:7718) Got SIGTERM. Performing graceful shutdown. 02/19/16 20:17:01 (pid:7718) All daemons are gone. Exiting. 02/19/16 20:17:01 (pid:7718) ** condor_master (condor_MASTER) pid 7718 EXITING WITH STATUS 99 ID: 2066 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2070 - Posted: 20 Feb 2016, 19:58:55 UTC - in response to Message 2066. Suspend should now work up to 1 hour. After that in the MasterLog you should see the error: ERROR: Child pid xxxx appears hung! Killing it hard. This is most probably due to the Condor NOT_RESPONDING_TIMEOUT parameter which is set by default to 3600s. ID: 2070 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 266	Message 2071 - Posted: 21 Feb 2016, 11:54:34 UTC - in response to Message 2070. Last modified: 21 Feb 2016, 11:55:11 UTC Suspend should now work up to 1 hour. I tested twice with a suspend period of 30 minutes. After resume the job ran on and wasn't killed. However it was also not killed with a suspend period of more than an hour. Begin processing the 1st record. Run 1, Event 177501, LumiSection 2131 at 21-Feb-2016 11:31:52.808 CET Begin processing the 2nd record. Run 1, Event 177502, LumiSection 2131 at 21-Feb-2016 11:31:56.513 CET Begin processing the 3rd record. Run 1, Event 177503, LumiSection 2131 at 21-Feb-2016 11:31:58.713 CET Begin processing the 4th record. Run 1, Event 177504, LumiSection 2131 at 21-Feb-2016 11:32:52.199 CET Begin processing the 5th record. Run 1, Event 177505, LumiSection 2131 at 21-Feb-2016 11:33:06.981 CET Begin processing the 6th record. Run 1, Event 177506, LumiSection 2131 at 21-Feb-2016 11:33:19.291 CET 21-Feb-2016 11:34:34 [CMS-dev] task wu_1455632064_138_0 suspended by user 21-Feb-2016 12:35:53 [CMS-dev] task wu_1455632064_138_0 resumed by user Begin processing the 7th record. Run 1, Event 177507, LumiSection 2131 at 21-Feb-2016 12:36:17.789 CET Begin processing the 8th record. Run 1, Event 177508, LumiSection 2131 at 21-Feb-2016 12:36:31.528 CET Begin processing the 9th record. Run 1, Event 177509, LumiSection 2131 at 21-Feb-2016 12:37:11.735 CET Begin processing the 10th record. Run 1, Event 177510, LumiSection 2131 at 21-Feb-2016 12:37:37.691 CET Begin processing the 11th record. Run 1, Event 177511, LumiSection 2131 at 21-Feb-2016 12:37:47.472 CET Begin processing the 12th record. Run 1, Event 177512, LumiSection 2131 at 21-Feb-2016 12:37:50.508 CET Begin processing the 13th record. Run 1, Event 177513, LumiSection 2131 at 21-Feb-2016 12:37:50.521 CET G4Fragment::CalculateExcitationEnergy(): WARNING Fragment: A = 61, Z = 28, U = -5.432e-02 MeV IsStable= 1 P = (2.128e+02,1.243e+02,-1.869e+02) MeV E = 5.674e+04 MeV Begin processing the 14th record. Run 1, Event 177514, LumiSection 2131 at 21-Feb-2016 12:38:54.855 CET Begin processing the 15th record. Run 1, Event 177515, LumiSection 2131 at 21-Feb-2016 12:38:54.887 CET Begin processing the 16th record. Run 1, Event 177516, LumiSection 2131 at 21-Feb-2016 12:39:18.576 CET Begin processing the 17th record. Run 1, Event 177517, LumiSection 2131 at 21-Feb-2016 12:40:03.898 CET etc etc etc Maybe my suspend period was just a bit too short? ID: 2071 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2072 - Posted: 21 Feb 2016, 14:27:02 UTC - in response to Message 2071. Maybe not. The not responding tineout was increased this morning so it maybe working now. ID: 2072 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 266	Message 2073 - Posted: 21 Feb 2016, 15:06:27 UTC - in response to Message 2072. Maybe not. The not responding tineout was increased this morning so it maybe working now. Maybe you can enlighten what the timeout is now. Just did a suspend on another (standalone=outside BOINC) VM with almost 2 hours suspend time and also that job continues where it was left. Begin processing the 1st record. Run 1, Event 193001, LumiSection 2317 at 21-Feb-2016 13:54:47.168 CET Begin processing the 2nd record. Run 1, Event 193002, LumiSection 2317 at 21-Feb-2016 13:55:52.829 CET Begin processing the 3rd record. Run 1, Event 193003, LumiSection 2317 at 21-Feb-2016 13:57:40.627 CET Begin processing the 4th record. Run 1, Event 193004, LumiSection 2317 at 21-Feb-2016 13:57:57.934 CET Begin processing the 5th record. Run 1, Event 193005, LumiSection 2317 at 21-Feb-2016 15:55:28.727 CET Begin processing the 6th record. Run 1, Event 193006, LumiSection 2317 at 21-Feb-2016 15:55:29.936 CET Begin processing the 7th record. Run 1, Event 193007, LumiSection 2317 at 21-Feb-2016 15:55:33.620 CET Begin processing the 8th record. Run 1, Event 193008, LumiSection 2317 at 21-Feb-2016 15:55:51.843 CET Begin processing the 9th record. Run 1, Event 193009, LumiSection 2317 at 21-Feb-2016 15:56:11.055 CET Begin processing the 10th record. Run 1, Event 193010, LumiSection 2317 at 21-Feb-2016 15:56:27.088 CET G4Fragment::CalculateExcitationEnergy(): WARNING Fragment: A = 26, Z = 12, U = -1.082e+00 MeV IsStable= 1 P = (1.370e+02,1.520e+02,-1.169e+02) MeV E = 2.420e+04 MeV Begin processing the 11th record. Run 1, Event 193011, LumiSection 2317 at 21-Feb-2016 15:56:47.451 CET Begin processing the 12th record. Run 1, Event 193012, LumiSection 2317 at 21-Feb-2016 15:56:49.414 CET Begin processing the 13th record. Run 1, Event 193013, LumiSection 2317 at 21-Feb-2016 15:57:11.801 CET Begin processing the 14th record. Run 1, Event 193014, LumiSection 2317 at 21-Feb-2016 15:57:44.081 CET etc etc etc ID: 2073 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 266	Message 2074 - Posted: 21 Feb 2016, 15:24:10 UTC I don't know, whether this is a side effect from the timeout increase. The first mentioned suspended job went to an end and uploaded, but the run don't get a new job. JobNumber 711 returned and after almost 2 hours still in wnpostproc-state. 02/21/16 11:28:04 (pid:12644) Running job as user (null) 02/21/16 11:28:04 (pid:12644) Create_Process succeeded, pid=12648 02/21/16 12:36:54 (pid:12644) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/21/16 12:36:54 (pid:12644) IO: Failed to read packet header 02/21/16 12:36:54 (pid:12644) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 02/21/16 12:36:54 (pid:12644) IO: Failed to read packet header 02/21/16 12:36:54 (pid:12644) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 02/21/16 12:36:54 (pid:12644) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 02/21/16 12:37:54 (pid:12644) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#125489 02/21/16 12:41:54 (pid:12644) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 12:41:54 (pid:12644) Buf::write(): condor_write() failed 02/21/16 12:46:55 (pid:12644) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 12:46:55 (pid:12644) Buf::write(): condor_write() failed 02/21/16 12:51:55 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 12:51:55 (pid:12644) Buf::write(): condor_write() failed 02/21/16 12:56:56 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 12:56:56 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:01:57 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:01:57 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:06:57 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:06:57 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:11:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:11:59 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:16:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:16:59 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:21:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:21:59 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:27:00 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:27:00 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:32:00 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:32:00 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:37:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:37:02 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:42:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:42:02 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:47:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:47:02 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:52:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:52:03 (pid:12644) Buf::write(): condor_write() failed 02/21/16 13:57:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 13:57:03 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:02:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:02:03 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:07:04 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:07:04 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:12:04 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:12:04 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:17:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:17:06 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:22:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:22:06 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:27:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:27:06 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:27:55 (pid:12644) Process exited, pid=12648, status=0 02/21/16 14:27:55 (pid:12644) condor_write(): Socket closed when trying to write 620 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:27:55 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:27:55 (pid:12644) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 11 02/21/16 14:27:55 (pid:12644) Buf::write(): condor_write() failed 02/21/16 14:27:55 (pid:12644) Failed to send job exit status to shadow 02/21/16 14:27:55 (pid:12644) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 02/21/16 14:27:55 (pid:12644) Returning from CStarter::JobReaper() ID: 2074 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2075 - Posted: 21 Feb 2016, 15:30:29 UTC - in response to Message 2073. 48 hours ID: 2075 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2076 - Posted: 21 Feb 2016, 15:32:15 UTC - in response to Message 2074. Yes. i think there is a consiquence of these changes that we need to follow up on Monday. ID: 2076 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 266	Message 2077 - Posted: 21 Feb 2016, 15:46:43 UTC OK. Thanks Laurence! I rebooted the VM to flush the inter(dis)actions and got a new job now. ID: 2077 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 300	Message 2078 - Posted: 21 Feb 2016, 18:34:16 UTC - in response to Message 2073. Last modified: 21 Feb 2016, 19:26:55 UTC Just did a suspend on another (standalone=outside BOINC) VM with almost 2 hours suspend time and also that job continues where it was left. I've just started a Linux host (1031) running BOINC. that has been stopped for ca twelve hours. The job in progress, 629, resumed properly as far as I can tell. It's got to event 238 as I write so will wait and see how (if) it completes on Dashboard. Edit. After about 45 mins no new job on the host and the first job is still showing "wnpostproc". The walltime shown on Dashboard includes the time for which the host was switched off - which will do wonders for any statistics based on this. ID: 2078 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2110 - Posted: 26 Feb 2016, 22:58:52 UTC Last modified: 26 Feb 2016, 23:10:11 UTC I suspended a task for 1h08min. It seemed to resume and finished the cmsrun with status 0. However in starter log it shows: 02/26/16 21:18:06 (pid:858) Running job as user (null) 02/26/16 21:18:07 (pid:858) Create_Process succeeded, pid=865 02/26/16 23:14:54 (pid:858) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/26/16 23:14:54 (pid:858) IO: Failed to read packet header 02/26/16 23:14:55 (pid:858) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9621. 02/26/16 23:14:55 (pid:858) IO: Failed to read packet header 02/26/16 23:14:55 (pid:858) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9621 02/26/16 23:14:55 (pid:858) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds. 02/26/16 23:15:55 (pid:858) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#128105 02/26/16 23:19:54 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:19:54 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:24:54 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:24:54 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:29:55 (pid:858) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:29:55 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:34:56 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:34:56 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:39:57 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:39:57 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:44:46 (pid:858) Process exited, pid=865, status=0 02/26/16 23:44:46 (pid:858) condor_write(): Socket closed when trying to write 616 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:44:46 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:44:47 (pid:858) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 11 02/26/16 23:44:47 (pid:858) Buf::write(): condor_write() failed 02/26/16 23:44:47 (pid:858) Failed to send job exit status to shadow 02/26/16 23:44:47 (pid:858) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 02/26/16 23:44:47 (pid:858) Returning from CStarter::JobReaper() Job is in wnpostproc state ever since.It is not starting a new job, waiting for (glidein?)to time out. ID: 2110 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2116 - Posted: 28 Feb 2016, 15:03:11 UTC Here is a comparison from two jobs suspended. StarterLog: TASK SUSPENDED FOR <20min (17min) 02/28/16 13:42:05 (pid:7017) Running job as user (null) 02/28/16 13:42:06 (pid:7017) Create_Process succeeded, pid=7035 02/28/16 14:13:46 (pid:7017) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/28/16 14:13:46 (pid:7017) IO: Failed to read packet header 02/28/16 14:17:33 (pid:7017) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9622. 02/28/16 14:17:33 (pid:7017) IO: Failed to read packet header 02/28/16 14:17:33 (pid:7017) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622 02/28/16 14:17:33 (pid:7017) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds. 02/28/16 14:18:33 (pid:7017) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#126924 02/28/16 14:18:46 (pid:7017) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/28/16 14:18:46 (pid:7017) Buf::write(): condor_write() failed 02/28/16 14:19:58 (pid:7017) Accepted request to reconnect from <130.246.180.120:9818> 02/28/16 14:19:58 (pid:7017) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1309> 02/28/16 14:19:58 (pid:7017) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1309> 02/28/16 15:05:23 (pid:7017) Process exited, pid=7035, status=0 TASK SUSPENDED FOR >20min (22min) 02/28/16 15:05:30 (pid:10166) Running job as user (null) 02/28/16 15:05:30 (pid:10166) Create_Process succeeded, pid=10172 02/28/16 15:37:54 (pid:10166) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/28/16 15:37:54 (pid:10166) IO: Failed to read packet header 02/28/16 15:39:54 (pid:10166) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9622. 02/28/16 15:39:54 (pid:10166) IO: Failed to read packet header 02/28/16 15:39:54 (pid:10166) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622 02/28/16 15:39:54 (pid:10166) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds. 02/28/16 15:40:54 (pid:10166) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#126962 02/28/16 15:42:55 (pid:10166) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/28/16 15:42:55 (pid:10166) Buf::write(): condor_write() failed 02/28/16 15:47:56 (pid:10166) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/28/16 15:47:56 (pid:10166) Buf::write(): condor_write() failed As you can see, if less than 20min the server reestablishes contact. Otherwise, not. Master_GCB_Reconnect_timeout is a variable, that is set to 1200sec (20min)and seems to be responsible. However, i would try to increase that a little (maybe 1800sec) and see, is that makes a difference. ID: 2116 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2117 - Posted: 28 Feb 2016, 15:41:20 UTC I have noticed, that all "unknown" jobs have a wall-time of exactly 24h. Even the older batches.(with JobLeaseDuration still set to 6 days) Maybe it is worth changing the JobLeaseDuration to slightly less than 24h). It seems to be a cutoff point.(timeout?) ID: 2117 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,875,421 RAC: 220	Message 2118 - Posted: 28 Feb 2016, 17:11:41 UTC - in response to Message 2117. I have noticed, that all "unknown" jobs have a wall-time of exactly 24h. Even the older batches.(with JobLeaseDuration still set to 6 days) Maybe it is worth changing the JobLeaseDuration to slightly less than 24h). It seems to be a cutoff point.(timeout?) That's an astute observation. I wonder if there's Yet Another Time-out that Laurence hasn't noticed yet? I set the current batch to 23:45:12 (85512 s), but note that the small test batch is unchanged even though I set its JLD to 300 secs yesterday. I'll try to look into this further tomorrow (the 100-event test batch is just about manageable in extent) but I do have other work with higher priority. ID: 2118 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2119 - Posted: 28 Feb 2016, 17:31:19 UTC the small test batch is unchanged even though I set its JLD to 300 secs yesterday . One needs to know, when changed,how long does it need to take effect? Straight away,at the next "run" or the next CMS-task? ID: 2119 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2120 - Posted: 28 Feb 2016, 18:31:16 UTC Last modified: 28 Feb 2016, 19:04:53 UTC OBSERVATIONS: If a task is suspended for <20min everything is fine. For task suspended >20min the following happens: After resume, the task is finished and uploaded.(sits in WNPostProc state) No new job is started. No new glidein(run) is started. It looks like NO WORK IS DONE until the CMS-24h task ends. I did not wait for the cms-task to finish(which it may not do, after 24h), but no new "run" has started and is more than 3h overdue. EDIT: 02/28/16 16:32:31 (pid:10166) condor_write(): Socket closed when trying to write 618 bytes to <130.246.180.120:9818>, fd is 11 02/28/16 16:32:31 (pid:10166) Buf::write(): condor_write() failed 02/28/16 16:32:32 (pid:10166) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11 02/28/16 16:32:32 (pid:10166) Buf::write(): condor_write() failed 02/28/16 16:32:32 (pid:10166) Failed to send job exit status to shadow 02/28/16 16:32:32 (pid:10166) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 02/28/16 16:32:32 (pid:10166) Returning from CStarter::JobReaper() I guess, that explains it. ID: 2120 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2121 - Posted: 28 Feb 2016, 21:07:44 UTC Last modified: 28 Feb 2016, 21:17:11 UTC Jobs are now coming out of WNPostProc state and change to "unknown" state. Not sure, if that is good or bad, but it seems related to changing the JLD to 1 day. We have to see, if they validate. EDIT: never mind, false alarm. ID: 2121 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2123 - Posted: 29 Feb 2016, 9:24:33 UTC - in response to Message 2116. I think that the condor_write() message is due to the shadow process disappearing. ID: 2123 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 184	Message 2129 - Posted: 29 Feb 2016, 16:29:15 UTC - in response to Message 2123. OK. We are targeting suspending up to 1 hour. The following setting have been pushed to CVMFS and should be available with new glideins (runs) this evening. NOT_RESPONDING_TIMEOUT = 7200 CCB_HEARTBEAT_INTERVAL 3600 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 You can check the glidein-stderr file from the graphics/logs for the settings. ID: 2129 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2130 - Posted: 29 Feb 2016, 16:41:09 UTC - in response to Message 2129. Last modified: 29 Feb 2016, 16:51:37 UTC Thanks, Laurence. We will have to address a timeout value, that is 20min. That is the critical time. I think , we need to go trough all timeout variables , that have a value of 20 min(1200sec) and decide, which one is a likely candidate. Then we change it slightly, maybe 1800sec and see, if the critical point has moved to that. Agreed? ID: 2130 · Rating: 0 · rate: / Reply Quote

Development for LHC@home