Message boards : News : Suspend/Resume
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2066 - Posted: 19 Feb 2016, 19:24:59 UTC - in response to Message 2064.  
Last modified: 19 Feb 2016, 19:33:57 UTC

Still does not work.
Suspend for 17min is ok.
Suspend for 23 min-restart from run-1 (was at run-4)

Condor not starting:

Masterlog

02/19/16 20:11:36 (pid:7718) ******************************************************
02/19/16 20:11:36 (pid:7718) ** condor_master (CONDOR_MASTER) STARTING UP
02/19/16 20:11:36 (pid:7718) ** /home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_master



EDIT:run-2started correctly
02/19/16 20:11:36 (pid:7718) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
02/19/16 20:11:36 (pid:7718) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
02/19/16 20:11:36 (pid:7718) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
02/19/16 20:11:36 (pid:7718) ** $CondorPlatform: x86_64_RedHat5 $
02/19/16 20:11:36 (pid:7718) ** PID = 7718
02/19/16 20:11:36 (pid:7718) ** Log last touched time unavailable (No such file or directory)
02/19/16 20:11:36 (pid:7718) ******************************************************
02/19/16 20:11:36 (pid:7718) Using config source: /home/boinc/CMSRun/glide_s1xJ7X/condor_config
02/19/16 20:11:36 (pid:7718) config Macros = 215, Sorted = 215, StringBytes = 10629, TablesBytes = 7780
02/19/16 20:11:36 (pid:7718) CLASSAD_CACHING is OFF
02/19/16 20:11:36 (pid:7718) Daemon Log is logging: D_ALWAYS D_ERROR
02/19/16 20:11:36 (pid:7718) DaemonCore: command socket at <10.0.2.15:41670?noUDP>
02/19/16 20:11:36 (pid:7718) DaemonCore: private command socket at <10.0.2.15:41670>
02/19/16 20:11:36 (pid:7718) CCBListener: heartbeat disabled because interval is configured to be 0
02/19/16 20:11:36 (pid:7718) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#119419
02/19/16 20:11:36 (pid:7718) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_master (mtime:1455909077)
02/19/16 20:11:36 (pid:7718) Started DaemonCore process "/home/boinc/CMSRun/glide_s1xJ7X/main/condor/sbin/condor_startd", pid and pgroup = 7721
02/19/16 20:12:38 (pid:7718) DefaultReaper unexpectedly called on pid 7721, status 25344.
02/19/16 20:12:38 (pid:7718) The STARTD (pid 7721) exited with status 99 (daemon will not restart automatically)
02/19/16 20:17:01 (pid:7718) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
02/19/16 20:17:01 (pid:7718) Got SIGTERM. Performing graceful shutdown.
02/19/16 20:17:01 (pid:7718) All daemons are gone. Exiting.
02/19/16 20:17:01 (pid:7718) **** condor_master (condor_MASTER) pid 7718 EXITING WITH STATUS 99
ID: 2066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2070 - Posted: 20 Feb 2016, 19:58:55 UTC - in response to Message 2066.  

Suspend should now work up to 1 hour.

After that in the MasterLog you should see the error:

ERROR: Child pid xxxx appears hung! Killing it hard.

This is most probably due to the Condor NOT_RESPONDING_TIMEOUT parameter which is set by default to 3600s.
ID: 2070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 2071 - Posted: 21 Feb 2016, 11:54:34 UTC - in response to Message 2070.  
Last modified: 21 Feb 2016, 11:55:11 UTC

Suspend should now work up to 1 hour.

I tested twice with a suspend period of 30 minutes. After resume the job ran on and wasn't killed.

However it was also not killed with a suspend period of more than an hour.

Begin processing the 1st record. Run 1, Event 177501, LumiSection 2131 at 21-Feb-2016 11:31:52.808 CET
Begin processing the 2nd record. Run 1, Event 177502, LumiSection 2131 at 21-Feb-2016 11:31:56.513 CET
Begin processing the 3rd record. Run 1, Event 177503, LumiSection 2131 at 21-Feb-2016 11:31:58.713 CET
Begin processing the 4th record. Run 1, Event 177504, LumiSection 2131 at 21-Feb-2016 11:32:52.199 CET
Begin processing the 5th record. Run 1, Event 177505, LumiSection 2131 at 21-Feb-2016 11:33:06.981 CET
Begin processing the 6th record. Run 1, Event 177506, LumiSection 2131 at 21-Feb-2016 11:33:19.291 CET


21-Feb-2016 11:34:34 [CMS-dev] task wu_1455632064_138_0 suspended by user
21-Feb-2016 12:35:53 [CMS-dev] task wu_1455632064_138_0 resumed by user

Begin processing the 7th record. Run 1, Event 177507, LumiSection 2131 at 21-Feb-2016 12:36:17.789 CET
Begin processing the 8th record. Run 1, Event 177508, LumiSection 2131 at 21-Feb-2016 12:36:31.528 CET
Begin processing the 9th record. Run 1, Event 177509, LumiSection 2131 at 21-Feb-2016 12:37:11.735 CET
Begin processing the 10th record. Run 1, Event 177510, LumiSection 2131 at 21-Feb-2016 12:37:37.691 CET
Begin processing the 11th record. Run 1, Event 177511, LumiSection 2131 at 21-Feb-2016 12:37:47.472 CET
Begin processing the 12th record. Run 1, Event 177512, LumiSection 2131 at 21-Feb-2016 12:37:50.508 CET
Begin processing the 13th record. Run 1, Event 177513, LumiSection 2131 at 21-Feb-2016 12:37:50.521 CET
G4Fragment::CalculateExcitationEnergy(): WARNING
Fragment: A = 61, Z = 28, U = -5.432e-02 MeV IsStable= 1
P = (2.128e+02,1.243e+02,-1.869e+02) MeV E = 5.674e+04 MeV

Begin processing the 14th record. Run 1, Event 177514, LumiSection 2131 at 21-Feb-2016 12:38:54.855 CET
Begin processing the 15th record. Run 1, Event 177515, LumiSection 2131 at 21-Feb-2016 12:38:54.887 CET
Begin processing the 16th record. Run 1, Event 177516, LumiSection 2131 at 21-Feb-2016 12:39:18.576 CET
Begin processing the 17th record. Run 1, Event 177517, LumiSection 2131 at 21-Feb-2016 12:40:03.898 CET
etc etc etc


Maybe my suspend period was just a bit too short?
ID: 2071 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2072 - Posted: 21 Feb 2016, 14:27:02 UTC - in response to Message 2071.  

Maybe not. The not responding tineout was increased this morning so it maybe working now.
ID: 2072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 2073 - Posted: 21 Feb 2016, 15:06:27 UTC - in response to Message 2072.  

Maybe not. The not responding tineout was increased this morning so it maybe working now.

Maybe you can enlighten what the timeout is now.
Just did a suspend on another (standalone=outside BOINC) VM with almost 2 hours suspend time and also that job continues where it was left.

Begin processing the 1st record. Run 1, Event 193001, LumiSection 2317 at 21-Feb-2016 13:54:47.168 CET
Begin processing the 2nd record. Run 1, Event 193002, LumiSection 2317 at 21-Feb-2016 13:55:52.829 CET
Begin processing the 3rd record. Run 1, Event 193003, LumiSection 2317 at 21-Feb-2016 13:57:40.627 CET
Begin processing the 4th record. Run 1, Event 193004, LumiSection 2317 at 21-Feb-2016 13:57:57.934 CET
Begin processing the 5th record. Run 1, Event 193005, LumiSection 2317 at 21-Feb-2016 15:55:28.727 CET
Begin processing the 6th record. Run 1, Event 193006, LumiSection 2317 at 21-Feb-2016 15:55:29.936 CET
Begin processing the 7th record. Run 1, Event 193007, LumiSection 2317 at 21-Feb-2016 15:55:33.620 CET
Begin processing the 8th record. Run 1, Event 193008, LumiSection 2317 at 21-Feb-2016 15:55:51.843 CET
Begin processing the 9th record. Run 1, Event 193009, LumiSection 2317 at 21-Feb-2016 15:56:11.055 CET
Begin processing the 10th record. Run 1, Event 193010, LumiSection 2317 at 21-Feb-2016 15:56:27.088 CET
G4Fragment::CalculateExcitationEnergy(): WARNING
Fragment: A = 26, Z = 12, U = -1.082e+00 MeV IsStable= 1
P = (1.370e+02,1.520e+02,-1.169e+02) MeV E = 2.420e+04 MeV

Begin processing the 11th record. Run 1, Event 193011, LumiSection 2317 at 21-Feb-2016 15:56:47.451 CET
Begin processing the 12th record. Run 1, Event 193012, LumiSection 2317 at 21-Feb-2016 15:56:49.414 CET
Begin processing the 13th record. Run 1, Event 193013, LumiSection 2317 at 21-Feb-2016 15:57:11.801 CET
Begin processing the 14th record. Run 1, Event 193014, LumiSection 2317 at 21-Feb-2016 15:57:44.081 CET
etc etc etc
ID: 2073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 2074 - Posted: 21 Feb 2016, 15:24:10 UTC

I don't know, whether this is a side effect from the timeout increase.
The first mentioned suspended job went to an end and uploaded, but the run don't get a new job.
JobNumber 711 returned and after almost 2 hours still in wnpostproc-state.

02/21/16 11:28:04 (pid:12644) Running job as user (null)
02/21/16 11:28:04 (pid:12644) Create_Process succeeded, pid=12648
02/21/16 12:36:54 (pid:12644) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/21/16 12:36:54 (pid:12644) IO: Failed to read packet header
02/21/16 12:36:54 (pid:12644) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620.
02/21/16 12:36:54 (pid:12644) IO: Failed to read packet header
02/21/16 12:36:54 (pid:12644) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620
02/21/16 12:36:54 (pid:12644) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
02/21/16 12:37:54 (pid:12644) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#125489
02/21/16 12:41:54 (pid:12644) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 12:41:54 (pid:12644) Buf::write(): condor_write() failed
02/21/16 12:46:55 (pid:12644) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 12:46:55 (pid:12644) Buf::write(): condor_write() failed
02/21/16 12:51:55 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 12:51:55 (pid:12644) Buf::write(): condor_write() failed
02/21/16 12:56:56 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 12:56:56 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:01:57 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:01:57 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:06:57 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:06:57 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:11:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:11:59 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:16:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:16:59 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:21:59 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:21:59 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:27:00 (pid:12644) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:27:00 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:32:00 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:32:00 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:37:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:37:02 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:42:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:42:02 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:47:02 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:47:02 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:52:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:52:03 (pid:12644) Buf::write(): condor_write() failed
02/21/16 13:57:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 13:57:03 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:02:03 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:02:03 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:07:04 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:07:04 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:12:04 (pid:12644) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:12:04 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:17:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:17:06 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:22:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:22:06 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:27:06 (pid:12644) condor_write(): Socket closed when trying to write 565 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:27:06 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:27:55 (pid:12644) Process exited, pid=12648, status=0
02/21/16 14:27:55 (pid:12644) condor_write(): Socket closed when trying to write 620 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:27:55 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:27:55 (pid:12644) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 11
02/21/16 14:27:55 (pid:12644) Buf::write(): condor_write() failed
02/21/16 14:27:55 (pid:12644) Failed to send job exit status to shadow
02/21/16 14:27:55 (pid:12644) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/21/16 14:27:55 (pid:12644) Returning from CStarter::JobReaper()
ID: 2074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2075 - Posted: 21 Feb 2016, 15:30:29 UTC - in response to Message 2073.  

48 hours
ID: 2075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2076 - Posted: 21 Feb 2016, 15:32:15 UTC - in response to Message 2074.  

Yes. i think there is a consiquence of these changes that we need to follow up on Monday.
ID: 2076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 2077 - Posted: 21 Feb 2016, 15:46:43 UTC

OK. Thanks Laurence!

I rebooted the VM to flush the inter(dis)actions and got a new job now.
ID: 2077 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 300
Message 2078 - Posted: 21 Feb 2016, 18:34:16 UTC - in response to Message 2073.  
Last modified: 21 Feb 2016, 19:26:55 UTC


Just did a suspend on another (standalone=outside BOINC) VM with almost 2 hours suspend time and also that job continues where it was left.


I've just started a Linux host (1031) running BOINC. that has been stopped for ca twelve hours. The job in progress, 629, resumed properly as far as I can tell. It's got to event 238 as I write so will wait and see how (if) it completes on Dashboard.

Edit. After about 45 mins no new job on the host and the first job is still showing "wnpostproc".

The walltime shown on Dashboard includes the time for which the host was switched off - which will do wonders for any statistics based on this.
ID: 2078 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2110 - Posted: 26 Feb 2016, 22:58:52 UTC
Last modified: 26 Feb 2016, 23:10:11 UTC

I suspended a task for 1h08min.
It seemed to resume and finished the cmsrun with status 0.
However in starter log it shows:

02/26/16 21:18:06 (pid:858) Running job as user (null)
02/26/16 21:18:07 (pid:858) Create_Process succeeded, pid=865
02/26/16 23:14:54 (pid:858) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/26/16 23:14:54 (pid:858) IO: Failed to read packet header
02/26/16 23:14:55 (pid:858) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9621.
02/26/16 23:14:55 (pid:858) IO: Failed to read packet header
02/26/16 23:14:55 (pid:858) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9621
02/26/16 23:14:55 (pid:858) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds.
02/26/16 23:15:55 (pid:858) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#128105
02/26/16 23:19:54 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:19:54 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:24:54 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:24:54 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:29:55 (pid:858) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:29:55 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:34:56 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:34:56 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:39:57 (pid:858) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:39:57 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:44:46 (pid:858) Process exited, pid=865, status=0
02/26/16 23:44:46 (pid:858) condor_write(): Socket closed when trying to write 616 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:44:46 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:44:47 (pid:858) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 11
02/26/16 23:44:47 (pid:858) Buf::write(): condor_write() failed
02/26/16 23:44:47 (pid:858) Failed to send job exit status to shadow
02/26/16 23:44:47 (pid:858) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/26/16 23:44:47 (pid:858) Returning from CStarter::JobReaper()




Job is in wnpostproc state ever since.It is not starting a new job, waiting for (glidein?)to time out.
ID: 2110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2116 - Posted: 28 Feb 2016, 15:03:11 UTC

Here is a comparison from two jobs suspended.

StarterLog:

TASK SUSPENDED FOR <20min (17min)


02/28/16 13:42:05 (pid:7017) Running job as user (null)
02/28/16 13:42:06 (pid:7017) Create_Process succeeded, pid=7035
02/28/16 14:13:46 (pid:7017) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/28/16 14:13:46 (pid:7017) IO: Failed to read packet header
02/28/16 14:17:33 (pid:7017) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9622.
02/28/16 14:17:33 (pid:7017) IO: Failed to read packet header
02/28/16 14:17:33 (pid:7017) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622
02/28/16 14:17:33 (pid:7017) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds.
02/28/16 14:18:33 (pid:7017) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#126924
02/28/16 14:18:46 (pid:7017) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/28/16 14:18:46 (pid:7017) Buf::write(): condor_write() failed
02/28/16 14:19:58 (pid:7017) Accepted request to reconnect from <130.246.180.120:9818>
02/28/16 14:19:58 (pid:7017) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1309>
02/28/16 14:19:58 (pid:7017) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1309>
02/28/16 15:05:23 (pid:7017) Process exited, pid=7035, status=0


TASK SUSPENDED FOR >20min (22min)

02/28/16 15:05:30 (pid:10166) Running job as user (null)
02/28/16 15:05:30 (pid:10166) Create_Process succeeded, pid=10172
02/28/16 15:37:54 (pid:10166) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/28/16 15:37:54 (pid:10166) IO: Failed to read packet header
02/28/16 15:39:54 (pid:10166) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9622.
02/28/16 15:39:54 (pid:10166) IO: Failed to read packet header
02/28/16 15:39:54 (pid:10166) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622
02/28/16 15:39:54 (pid:10166) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds.
02/28/16 15:40:54 (pid:10166) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#126962
02/28/16 15:42:55 (pid:10166) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/28/16 15:42:55 (pid:10166) Buf::write(): condor_write() failed
02/28/16 15:47:56 (pid:10166) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/28/16 15:47:56 (pid:10166) Buf::write(): condor_write() failed


As you can see, if less than 20min the server reestablishes contact.

Otherwise, not.

Master_GCB_Reconnect_timeout is a variable, that is set to 1200sec (20min)and seems to be responsible.
However, i would try to increase that a little (maybe 1800sec) and see, is that makes a difference.
ID: 2116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2117 - Posted: 28 Feb 2016, 15:41:20 UTC

I have noticed, that all "unknown" jobs have a wall-time of exactly 24h.

Even the older batches.(with JobLeaseDuration still set to 6 days)
Maybe it is worth changing the JobLeaseDuration to slightly less than 24h).
It seems to be a cutoff point.(timeout?)
ID: 2117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,875,421
RAC: 220
Message 2118 - Posted: 28 Feb 2016, 17:11:41 UTC - in response to Message 2117.  

I have noticed, that all "unknown" jobs have a wall-time of exactly 24h.

Even the older batches.(with JobLeaseDuration still set to 6 days)
Maybe it is worth changing the JobLeaseDuration to slightly less than 24h).
It seems to be a cutoff point.(timeout?)

That's an astute observation. I wonder if there's Yet Another Time-out that Laurence hasn't noticed yet? I set the current batch to 23:45:12 (85512 s), but note that the small test batch is unchanged even though I set its JLD to 300 secs yesterday.
I'll try to look into this further tomorrow (the 100-event test batch is just about manageable in extent) but I do have other work with higher priority.
ID: 2118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2119 - Posted: 28 Feb 2016, 17:31:19 UTC

the small test batch is unchanged even though I set its JLD to 300 secs yesterday
.

One needs to know, when changed,how long does it need to take effect?
Straight away,at the next "run" or the next CMS-task?
ID: 2119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2120 - Posted: 28 Feb 2016, 18:31:16 UTC
Last modified: 28 Feb 2016, 19:04:53 UTC

OBSERVATIONS:

If a task is suspended for <20min everything is fine.
For task suspended >20min the following happens:

After resume, the task is finished and uploaded.(sits in WNPostProc state)
No new job is started.
No new glidein(run) is started.

It looks like NO WORK IS DONE until the CMS-24h task ends.

I did not wait for the cms-task to finish(which it may not do, after 24h),
but no new "run" has started and is more than 3h overdue.

EDIT:
02/28/16 16:32:31 (pid:10166) condor_write(): Socket closed when trying to write 618 bytes to <130.246.180.120:9818>, fd is 11
02/28/16 16:32:31 (pid:10166) Buf::write(): condor_write() failed
02/28/16 16:32:32 (pid:10166) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11
02/28/16 16:32:32 (pid:10166) Buf::write(): condor_write() failed
02/28/16 16:32:32 (pid:10166) Failed to send job exit status to shadow
02/28/16 16:32:32 (pid:10166) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/28/16 16:32:32 (pid:10166) Returning from CStarter::JobReaper()

I guess, that explains it.
ID: 2120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2121 - Posted: 28 Feb 2016, 21:07:44 UTC
Last modified: 28 Feb 2016, 21:17:11 UTC

Jobs are now coming out of WNPostProc state and change to "unknown" state.
Not sure, if that is good or bad, but it seems related to changing the JLD to 1 day.

We have to see, if they validate.


EDIT: never mind, false alarm.
ID: 2121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2123 - Posted: 29 Feb 2016, 9:24:33 UTC - in response to Message 2116.  

I think that the condor_write() message is due to the shadow process disappearing.
ID: 2123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 2129 - Posted: 29 Feb 2016, 16:29:15 UTC - in response to Message 2123.  

OK. We are targeting suspending up to 1 hour. The following setting have been pushed to CVMFS and should be available with new glideins (runs) this evening.

NOT_RESPONDING_TIMEOUT = 7200
CCB_HEARTBEAT_INTERVAL 3600
ALIVE_INTERVAL = 1800
MAX_CLAIM_ALIVES_MISSED = 6
JobLeaseDuration = 7200

You can check the glidein-stderr file from the graphics/logs for the settings.
ID: 2129 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2130 - Posted: 29 Feb 2016, 16:41:09 UTC - in response to Message 2129.  
Last modified: 29 Feb 2016, 16:51:37 UTC

Thanks, Laurence.

We will have to address a timeout value, that is 20min.
That is the critical time.
I think , we need to go trough all timeout variables , that have a value of 20 min(1200sec) and decide, which one is a likely candidate.

Then we change it slightly, maybe 1800sec and see, if the critical point has moved to that.

Agreed?
ID: 2130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : News : Suspend/Resume


©2024 CERN