Message boards : News : Suspend/Resume
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2131 - Posted: 29 Feb 2016, 21:32:49 UTC - in response to Message 2129.  

OK. We are targeting suspending up to 1 hour.

After the 1st job started with a new run, I paused the VM for 30 minutes.

Results after resume:

- Events went on processing (job 1610)
- Job result uploaded OK
- New Job (1671) started without long interruption in the same glidein run
- Postprocessing lasted a bit longer (or was I impatient) but finally
- Job finished successful on Dashboard.
ID: 2131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2132 - Posted: 29 Feb 2016, 21:47:06 UTC - in response to Message 2131.  
Last modified: 29 Feb 2016, 21:48:17 UTC

Good news!
What are you trying next 40min?
I will try the next one up(50min).

I am trying 21 min, but not finished, yet.
ID: 2132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2133 - Posted: 29 Feb 2016, 22:01:51 UTC - in response to Message 2132.  

What are you trying next 40min?

I try 360min of suspending myself ;)
In 1 hour it's finally March.
New month, new day, new chances!
ID: 2133 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2134 - Posted: 29 Feb 2016, 22:05:43 UTC - in response to Message 2133.  

I try 360min of suspending myself ;)


6 hours, that is ambitious.
ID: 2134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2135 - Posted: 29 Feb 2016, 23:05:40 UTC

After the 1st job started with a new run, I paused the VM for 30 minutes.


I did a test with 21min suspend and it failed; same as before.

Are you sure, it worked with 30min?
ID: 2135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2136 - Posted: 1 Mar 2016, 0:30:05 UTC

I found, that the CCB_Heartbeat also serves to keep the network connection to the server open.However, if there is no traffic on this line for a while, the ISP server considers the line dead and disconnects. This is independent of any cern-server or host settings.

One solution could be to set the CCB_Heartbeat_Interval to 0.
According to the manual, that is possible.
This should also cause a proper reconnect to the server instead of trying to send information over a line, that is presumed open(but closed).

The other option would be to investigate other CCB related parameters, which i do not want to get into, at this point.

This "line dead" time could be different between ISPs, which would explain, why different volunteers have different timeouts.
ID: 2136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2137 - Posted: 1 Mar 2016, 7:44:28 UTC - in response to Message 2135.  
Last modified: 1 Mar 2016, 7:59:28 UTC

Are you sure, it worked with 30min?

Yeah, it was absolutely error free and job 1610 has finished state on the dashboard as next job 1671 has.

Begin processing the 2nd record. Run 1, Event 402252, LumiSection 4828 at 29-Feb-2016 19:08:56.158 CET
Begin processing the 3rd record. Run 1, Event 402253, LumiSection 4828 at 29-Feb-2016 19:46:40.065 CET


Now testing with a suspend period of 55 minutes.
ID: 2137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 2138 - Posted: 1 Mar 2016, 8:54:27 UTC - in response to Message 2136.  

It should try to reconnect. Do you have the StarterLog available?
ID: 2138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2139 - Posted: 1 Mar 2016, 9:10:03 UTC - in response to Message 2138.  
Last modified: 1 Mar 2016, 9:15:52 UTC

It should try to reconnect. Do you have the StarterLog available?

I suppose you mean Rasputin, but in the StarterLog is only

02/29/16 19:04:17 (pid:3363) Create_Process succeeded, pid=3367
02/29/16 22:03:59 (pid:3363) Process exited, pid=3367, status=0


No sign of a reconnect. The 30 minutes suspend period is in that time frame. See former post.

Off topic: Could it be that the 77MB root file is uploaded twice?
ID: 2139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 544
Message 2141 - Posted: 1 Mar 2016, 10:07:02 UTC - in response to Message 2136.  
Last modified: 1 Mar 2016, 10:37:01 UTC

I found, that the CCB_Heartbeat also serves to keep the network connection to the server open.However, if there is no traffic on this line for a while, the ISP server considers the line dead and disconnects. This is independent of any cern-server or host settings.

One solution could be to set the CCB_Heartbeat_Interval to 0.
According to the manual, that is possible.
This should also cause a proper reconnect to the server instead of trying to send information over a line, that is presumed open(but closed).

The other option would be to investigate other CCB related parameters, which i do not want to get into, at this point.

This "line dead" time could be different between ISPs, which would explain, why different volunteers have different timeouts.


It might not be your ISP. My router (the PPP part) has this facility, this is from the setup help:-

"Idle Timeout

An idle Internet connection will be terminated after this time period.

If this value is zero (0), then the connection will be "kept alive" by re-connecting immediately whenever the connection is lost."


It implies that the connection is always allowed to die at some point.

Edit: suppose that the ISP could do it as well but I don't think much
of my chances of getting them to tell me if they do and what the timeout is.
ID: 2141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2142 - Posted: 1 Mar 2016, 10:31:32 UTC - in response to Message 2137.  

Now testing with a suspend period of 55 minutes.

The job survived the 55 minutes suspend and is on Dashboard recorded as finished properly.

1 strange thing: Start 01-03 06:45:40 Finish 01-03 10:06:21 WallTime 24:00:01
This is maybe caused while I was not the initial receiver of the job. The WNIp is from someone else.

Next Job started OK, but is was in a new Glide run, cause the former run had run more than 6 hours.
ID: 2142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2143 - Posted: 1 Mar 2016, 11:27:42 UTC
Last modified: 1 Mar 2016, 12:01:13 UTC

I have the Starter log.
First, i had a task suspended for 18min-all fine(job1653). (start20.48local)
Then i suspended for 21min- fail(job1683).(start 22.10local)

I left it and after about 1hour, it started a new task and finished normally (job1634).(start00.51local).
Then it created a new run and continued. EDIT:see bottom for timeout reason.

STARTERLOG:
02/29/16 19:20:33 (pid:7637) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 19:20:33 (pid:7637) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 19:20:33 (pid:7637) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 19:20:42 (pid:7647) ******************************************************
02/29/16 19:20:42 (pid:7647) ** condor_starter (CONDOR_STARTER) STARTING UP
02/29/16 19:20:42 (pid:7647) ** /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter
02/29/16 19:20:42 (pid:7647) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/29/16 19:20:42 (pid:7647) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/29/16 19:20:42 (pid:7647) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
02/29/16 19:20:42 (pid:7647) ** $CondorPlatform: x86_64_RedHat5 $
02/29/16 19:20:42 (pid:7647) ** PID = 7647
02/29/16 19:20:42 (pid:7647) ** Log last touched 2/29 19:20:33
02/29/16 19:20:42 (pid:7647) ******************************************************
02/29/16 19:20:42 (pid:7647) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config
02/29/16 19:20:42 (pid:7647) config Macros = 217, Sorted = 217, StringBytes = 10777, TablesBytes = 7852
02/29/16 19:20:42 (pid:7647) CLASSAD_CACHING is OFF
02/29/16 19:20:42 (pid:7647) Daemon Log is logging: D_ALWAYS D_ERROR
02/29/16 19:20:42 (pid:7647) DaemonCore: command socket at <10.0.2.15:37517?noUDP>
02/29/16 19:20:42 (pid:7647) DaemonCore: private command socket at <10.0.2.15:37517>
02/29/16 19:20:42 (pid:7647) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133741
02/29/16 19:20:42 (pid:7647) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1979>
02/29/16 19:20:42 (pid:7647) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
02/29/16 19:20:42 (pid:7647) setting the orig job name in starter
02/29/16 19:20:42 (pid:7647) setting the orig job iwd in starter
02/29/16 19:20:42 (pid:7647) Chirp config summary: IO false, Updates false, Delayed updates true.
02/29/16 19:20:42 (pid:7647) Initialized IO Proxy.
02/29/16 19:20:42 (pid:7647) Done setting resource limits
02/29/16 19:20:42 (pid:7647) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 19:20:42 (pid:7647) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 19:20:46 (pid:7647) File transfer completed successfully.
02/29/16 19:20:46 (pid:7647) Job 1029118.0 set to execute immediately
02/29/16 19:20:46 (pid:7647) Starting a VANILLA universe job with ID: 1029118.0
02/29/16 19:20:46 (pid:7647) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647
02/29/16 19:20:46 (pid:7647) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/_condor_stdout
02/29/16 19:20:46 (pid:7647) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/_condor_stderr
02/29/16 19:20:46 (pid:7647) Renice expr "0" evaluated to 0
02/29/16 19:20:46 (pid:7647) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1616 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1616.txt --runAndLumis=job_lumis_1616.json --lheInputFiles=False --firstEvent=403751 --firstLumi=4846 --lastEvent=404001 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
02/29/16 19:20:46 (pid:7647) Running job as user (null)
02/29/16 19:20:46 (pid:7647) Create_Process succeeded, pid=7654
02/29/16 20:48:07 (pid:7647) Process exited, pid=7654, status=0
02/29/16 20:48:08 (pid:7647) Got SIGQUIT. Performing fast shutdown.
02/29/16 20:48:08 (pid:7647) ShutdownFast all jobs.
02/29/16 20:48:08 (pid:7647) **** condor_starter (condor_STARTER) pid 7647 EXITING WITH STATUS 0
02/29/16 20:48:09 (pid:11892) ******************************************************
02/29/16 20:48:09 (pid:11892) ** condor_starter (CONDOR_STARTER) STARTING UP
02/29/16 20:48:09 (pid:11892) ** /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter
02/29/16 20:48:09 (pid:11892) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/29/16 20:48:09 (pid:11892) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/29/16 20:48:09 (pid:11892) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
02/29/16 20:48:09 (pid:11892) ** $CondorPlatform: x86_64_RedHat5 $
02/29/16 20:48:09 (pid:11892) ** PID = 11892
02/29/16 20:48:09 (pid:11892) ** Log last touched 2/29 20:48:08
02/29/16 20:48:09 (pid:11892) ******************************************************
02/29/16 20:48:09 (pid:11892) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config
02/29/16 20:48:09 (pid:11892) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852
02/29/16 20:48:09 (pid:11892) CLASSAD_CACHING is OFF
02/29/16 20:48:09 (pid:11892) Daemon Log is logging: D_ALWAYS D_ERROR
02/29/16 20:48:09 (pid:11892) DaemonCore: command socket at <10.0.2.15:49153?noUDP>
02/29/16 20:48:09 (pid:11892) DaemonCore: private command socket at <10.0.2.15:49153>
02/29/16 20:48:11 (pid:11892) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133789
02/29/16 20:48:11 (pid:11892) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021>
02/29/16 20:48:11 (pid:11892) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
02/29/16 20:48:11 (pid:11892) setting the orig job name in starter
02/29/16 20:48:11 (pid:11892) setting the orig job iwd in starter
02/29/16 20:48:11 (pid:11892) Chirp config summary: IO false, Updates false, Delayed updates true.
02/29/16 20:48:11 (pid:11892) Initialized IO Proxy.
02/29/16 20:48:11 (pid:11892) Done setting resource limits
02/29/16 20:48:11 (pid:11892) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 20:48:11 (pid:11892) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 20:48:15 (pid:11892) File transfer completed successfully.
02/29/16 20:48:15 (pid:11892) Job 1029154.0 set to execute immediately
02/29/16 20:48:15 (pid:11892) Starting a VANILLA universe job with ID: 1029154.0
02/29/16 20:48:15 (pid:11892) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892
02/29/16 20:48:15 (pid:11892) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/_condor_stdout
02/29/16 20:48:15 (pid:11892) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/_condor_stderr
02/29/16 20:48:15 (pid:11892) Renice expr "0" evaluated to 0
02/29/16 20:48:15 (pid:11892) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1653 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1653.txt --runAndLumis=job_lumis_1653.json --lheInputFiles=False --firstEvent=413001 --firstLumi=4957 --lastEvent=413251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
02/29/16 20:48:15 (pid:11892) Running job as user (null)
02/29/16 20:48:15 (pid:11892) Create_Process succeeded, pid=11899
02/29/16 21:11:33 (pid:11892) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/29/16 21:11:33 (pid:11892) IO: Failed to read packet header
02/29/16 21:12:20 (pid:11892) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620.
02/29/16 21:12:20 (pid:11892) IO: Failed to read packet header
02/29/16 21:12:20 (pid:11892) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620
02/29/16 21:12:20 (pid:11892) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
02/29/16 21:13:21 (pid:11892) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133789
02/29/16 21:16:33 (pid:11892) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 21:16:33 (pid:11892) Buf::write(): condor_write() failed
02/29/16 21:16:53 (pid:11892) Accepted request to reconnect from <130.246.180.120:9818>
02/29/16 21:16:53 (pid:11892) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021>
02/29/16 21:16:53 (pid:11892) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021>
02/29/16 22:10:16 (pid:11892) Process exited, pid=11899, status=0
02/29/16 22:10:18 (pid:11892) Got SIGQUIT. Performing fast shutdown.
02/29/16 22:10:18 (pid:11892) ShutdownFast all jobs.
02/29/16 22:10:18 (pid:11892) **** condor_starter (condor_STARTER) pid 11892 EXITING WITH STATUS 0
02/29/16 22:10:19 (pid:14925) ******************************************************
02/29/16 22:10:19 (pid:14925) ** condor_starter (CONDOR_STARTER) STARTING UP
02/29/16 22:10:19 (pid:14925) ** /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter
02/29/16 22:10:19 (pid:14925) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
02/29/16 22:10:19 (pid:14925) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
02/29/16 22:10:19 (pid:14925) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
02/29/16 22:10:19 (pid:14925) ** $CondorPlatform: x86_64_RedHat5 $
02/29/16 22:10:19 (pid:14925) ** PID = 14925
02/29/16 22:10:19 (pid:14925) ** Log last touched 2/29 22:10:18
02/29/16 22:10:19 (pid:14925) ******************************************************
02/29/16 22:10:19 (pid:14925) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config
02/29/16 22:10:19 (pid:14925) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852
02/29/16 22:10:19 (pid:14925) CLASSAD_CACHING is OFF
02/29/16 22:10:19 (pid:14925) Daemon Log is logging: D_ALWAYS D_ERROR
02/29/16 22:10:19 (pid:14925) DaemonCore: command socket at <10.0.2.15:48600?noUDP>
02/29/16 22:10:19 (pid:14925) DaemonCore: private command socket at <10.0.2.15:48600>
02/29/16 22:10:20 (pid:14925) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133839
02/29/16 22:10:20 (pid:14925) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2051>
02/29/16 22:10:20 (pid:14925) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
02/29/16 22:10:20 (pid:14925) setting the orig job name in starter
02/29/16 22:10:20 (pid:14925) setting the orig job iwd in starter
02/29/16 22:10:20 (pid:14925) Chirp config summary: IO false, Updates false, Delayed updates true.
02/29/16 22:10:20 (pid:14925) Initialized IO Proxy.
02/29/16 22:10:20 (pid:14925) Done setting resource limits
02/29/16 22:10:20 (pid:14925) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 22:10:20 (pid:14925) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
02/29/16 22:10:24 (pid:14925) File transfer completed successfully.
02/29/16 22:10:24 (pid:14925) Job 1029183.0 set to execute immediately
02/29/16 22:10:24 (pid:14925) Starting a VANILLA universe job with ID: 1029183.0
02/29/16 22:10:24 (pid:14925) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925
02/29/16 22:10:24 (pid:14925) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/_condor_stdout
02/29/16 22:10:24 (pid:14925) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/_condor_stderr
02/29/16 22:10:24 (pid:14925) Renice expr "0" evaluated to 0
02/29/16 22:10:24 (pid:14925) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1683 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1683.txt --runAndLumis=job_lumis_1683.json --lheInputFiles=False --firstEvent=420501 --firstLumi=5047 --lastEvent=420751 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
02/29/16 22:10:24 (pid:14925) Running job as user (null)
02/29/16 22:10:24 (pid:14925) Create_Process succeeded, pid=14931
02/29/16 22:42:37 (pid:14925) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
02/29/16 22:42:37 (pid:14925) IO: Failed to read packet header
02/29/16 22:43:37 (pid:14925) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620.
02/29/16 22:43:37 (pid:14925) IO: Failed to read packet header
02/29/16 22:43:37 (pid:14925) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620
02/29/16 22:43:37 (pid:14925) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
02/29/16 22:44:38 (pid:14925) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133839
02/29/16 22:47:38 (pid:14925) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 22:47:38 (pid:14925) Buf::write(): condor_write() failed
02/29/16 22:52:38 (pid:14925) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 22:52:38 (pid:14925) Buf::write(): condor_write() failed
02/29/16 22:57:39 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 22:57:39 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:02:40 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:02:40 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:07:40 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:07:40 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:12:41 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:12:41 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:17:42 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:17:42 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:22:42 (pid:14925) condor_write(): Socket closed when trying to write 564 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:22:42 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:27:42 (pid:14925) condor_write(): Socket closed when trying to write 564 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:27:42 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:32:43 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:32:43 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:37:44 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:37:44 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:40:35 (pid:14925) Process exited, pid=14931, status=0
02/29/16 23:40:35 (pid:14925) condor_write(): Socket closed when trying to write 618 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:40:35 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:40:38 (pid:14925) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11
02/29/16 23:40:38 (pid:14925) Buf::write(): condor_write() failed
02/29/16 23:40:38 (pid:14925) Failed to send job exit status to shadow
02/29/16 23:40:38 (pid:14925) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/29/16 23:40:38 (pid:14925) Returning from CStarter::JobReaper()
03/01/16 00:50:27 (pid:14925) Got SIGQUIT. Performing fast shutdown.
03/01/16 00:50:27 (pid:14925) ShutdownFast all jobs.
03/01/16 00:50:27 (pid:14925) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 00:50:27 (pid:14925) Buf::write(): condor_write() failed
03/01/16 00:50:27 (pid:14925) Failed to send job exit status to shadow
03/01/16 00:50:27 (pid:14925) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
03/01/16 00:51:06 (pid:21133) ******************************************************
03/01/16 00:51:06 (pid:21133) ** condor_starter (CONDOR_STARTER) STARTING UP
03/01/16 00:51:06 (pid:21133) ** /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter
03/01/16 00:51:06 (pid:21133) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
03/01/16 00:51:06 (pid:21133) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/01/16 00:51:06 (pid:21133) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
03/01/16 00:51:06 (pid:21133) ** $CondorPlatform: x86_64_RedHat5 $
03/01/16 00:51:06 (pid:21133) ** PID = 21133
03/01/16 00:51:06 (pid:21133) ** Log last touched 3/1 00:50:41
03/01/16 00:51:06 (pid:21133) ******************************************************
03/01/16 00:51:06 (pid:21133) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config
03/01/16 00:51:06 (pid:21133) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852
03/01/16 00:51:06 (pid:21133) CLASSAD_CACHING is OFF
03/01/16 00:51:06 (pid:21133) Daemon Log is logging: D_ALWAYS D_ERROR
03/01/16 00:51:06 (pid:21133) DaemonCore: command socket at <10.0.2.15:51596?noUDP>
03/01/16 00:51:06 (pid:21133) DaemonCore: private command socket at <10.0.2.15:51596>
03/01/16 00:51:06 (pid:21133) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133939
03/01/16 00:51:06 (pid:21133) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2133>
03/01/16 00:51:06 (pid:21133) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
03/01/16 00:51:06 (pid:21133) setting the orig job name in starter
03/01/16 00:51:06 (pid:21133) setting the orig job iwd in starter
03/01/16 00:51:06 (pid:21133) Chirp config summary: IO false, Updates false, Delayed updates true.
03/01/16 00:51:06 (pid:21133) Initialized IO Proxy.
03/01/16 00:51:06 (pid:21133) Done setting resource limits
03/01/16 00:51:06 (pid:21133) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
03/01/16 00:51:06 (pid:21133) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
03/01/16 00:51:10 (pid:21133) File transfer completed successfully.
03/01/16 00:51:10 (pid:21133) Job 1029140.0 set to execute immediately
03/01/16 00:51:10 (pid:21133) Starting a VANILLA universe job with ID: 1029140.0
03/01/16 00:51:10 (pid:21133) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133
03/01/16 00:51:10 (pid:21133) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/_condor_stdout
03/01/16 00:51:10 (pid:21133) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/_condor_stderr
03/01/16 00:51:10 (pid:21133) Renice expr "0" evaluated to 0
03/01/16 00:51:10 (pid:21133) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1634 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1634.txt --runAndLumis=job_lumis_1634.json --lheInputFiles=False --firstEvent=408251 --firstLumi=4900 --lastEvent=408501 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
03/01/16 00:51:10 (pid:21133) Running job as user (null)
03/01/16 00:51:10 (pid:21133) Create_Process succeeded, pid=21140
03/01/16 01:47:39 (pid:21133) Process exited, pid=21140, status=0
03/01/16 01:47:40 (pid:21133) Got SIGQUIT. Performing fast shutdown.
03/01/16 01:47:40 (pid:21133) ShutdownFast all jobs.
03/01/16 01:47:41 (pid:21133) **** condor_starter (condor_STARTER) pid 21133 EXITING WITH STATUS 0



EXTRACT FROM STARDLOG:

03/01/16 00:50:21 (pid:7634) Response problem from schedd <130.246.180.120:9818?noUDP&sock=2939_634c> on ALIVE job 1029183.0.
03/01/16 00:50:27 (pid:7634) State change: claim no longer recognized by the schedd - removing claim
03/01/16 00:50:27 (pid:7634) Changing state and activity: Claimed/Busy -> Preempting/Killing
03/01/16 00:50:57 (pid:7634) starter (pid 14925) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents.
03/01/16 00:50:57 (pid:7634) Starter pid 14925 died on signal 9 (signal 9 (Killed))
03/01/16 00:50:57 (pid:7634) State change: starter exited
03/01/16 00:50:57 (pid:7634) SECMAN: SetSessionExpiration failed to find session <10.0.2.15:44134>#1456770033#1
03/01/16 00:50:57 (pid:7634) State change: No preempting claim, returning to owner
03/01/16 00:50:57 (pid:7634) Changing state and activity: Preempting/Killing -> Owner/Idle
03/01/16 00:50:57 (pid:7634) State change: IS_OWNER is false
03/01/16 00:50:57 (pid:7634) Changing state: Owner -> Unclaimed
03/01/16 00:50:57 (pid:7634) State change: RunBenchmarks is TRUE
03/01/16 00:50:57 (pid:7634) Changing activity: Idle -> Benchmarking
03/01/16 00:50:57 (pid:7634) BenchMgr:StartBenchmarks()
03/01/
ID: 2143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2145 - Posted: 1 Mar 2016, 12:07:53 UTC

It looks like you suspended the task not during processing the events, but during loading packet information for the cmsRun just started. That could be critical and caused the failure.
ID: 2145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2146 - Posted: 1 Mar 2016, 12:14:37 UTC - in response to Message 2145.  
Last modified: 1 Mar 2016, 12:17:07 UTC

Not correct:

Begin processing the 29th record. Run 1, Event 420529, LumiSection 5047 at 29-Feb-2016 22:19:55.366 CET
Begin processing the 30th record. Run 1, Event 420530, LumiSection 5047 at 29-Feb-2016 22:19:59.189 CET
Begin processing the 31st record. Run 1, Event 420531, LumiSection 5047 at 29-Feb-2016 22:20:05.174 CET
Begin processing the 32nd record. Run 1, Event 420532, LumiSection 5047 at 29-Feb-2016 22:20:13.592 CET
Begin processing the 33rd record. Run 1, Event 420533, LumiSection 5047 at 29-Feb-2016 22:20:48.787 CET
Begin processing the 34th record. Run 1, Event 420534, LumiSection 5047 at 29-Feb-2016 22:21:09.296 CET
Begin processing the 35th record. Run 1, Event 420535, LumiSection 5047 at 29-Feb-2016 22:21:09.306 CET
Begin processing the 36th record. Run 1, Event 420536, LumiSection 5047 at 29-Feb-2016 22:42:42.785 CET

Begin processing the 37th record. Run 1, Event 420537, LumiSection 5047 at 29-Feb-2016 22:43:00.068 CET
Begin processing the 38th record. Run 1, Event 420538, LumiSection 5047 at 29-Feb-2016 22:43:13.093 CET
Begin processing the 39th record. Run 1, Event 420539, LumiSection 5047 at 29-Feb-2016 22:43:47.101 CET
Begin processing the 40th record. Run 1, Event 420540, LumiSection 5047 at 29-Feb-2016 22:44:16.164 CET
Begin processing the 41st record. Run 1, Event 420541, LumiSection 5047 at 29-Feb-2016 22:44:18.219 CET
Begin processing the 42nd record. Run 1, Event 420542, LumiSection 5047 at 29-Feb-2016 22:44:20.965 CET
Begin p
ID: 2146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2147 - Posted: 1 Mar 2016, 14:37:29 UTC
Last modified: 1 Mar 2016, 14:39:33 UTC

I had a job running. I unplugged the internet from 14.33-14.43 local.


03/01/16 14:14:50 (pid:8055) Using wrapper /home/boinc/CMSRun/glide_OztfD8/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_OztfD8/execute/dir_8055/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=2101 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_2101.txt --runAndLumis=job_lumis_2101.json --lheInputFiles=False --firstEvent=525001 --firstLumi=6301 --lastEvent=525251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
03/01/16 14:14:50 (pid:8055) Running job as user (null)
03/01/16 14:14:50 (pid:8055) Create_Process succeeded, pid=8061
03/01/16 14:40:01 (pid:8055) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 14:40:01 (pid:8055) Buf::write(): condor_write() failed
03/01/16 14:45:02 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 14:45:02 (pid:8055) Buf::write(): condor_write() failed
03/01/16 14:50:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 14:50:04 (pid:8055) Buf::write(): condor_write() failed
03/01/16 14:55:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 14:55:04 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:00:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 15:00:04 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:05:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11


Then then job finished and uploaded and was reported a success.

======== Stageout at Tue Mar 1 14:20:02 GMT 2016 FINISHING (short status 0) ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 1 14:20:02 GMT 2016 on 277-617-13328 with (short) status 0 ========
Local time: Tue Mar 1 15:20:02 CET 2016
Short exit status: 0
Job Running time in seconds: 3912





03/01/16 15:05:04 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:10:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 15:10:04 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:15:05 (pid:8055) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622
03/01/16 15:15:05 (pid:8055) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds.
03/01/16 15:15:05 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 15:15:05 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:16:15 (pid:8055) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#128708
03/01/16 15:20:02 (pid:8055) Process exited, pid=8061, status=0
03/01/16 15:20:02 (pid:8055) condor_write(): Socket closed when trying to write 617 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 15:20:02 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:20:08 (pid:8055) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11
03/01/16 15:20:08 (pid:8055) Buf::write(): condor_write() failed
03/01/16 15:20:08 (pid:8055) Failed to send job exit status to shadow
03/01/16 15:20:08 (pid:8055) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
03/01/16 15:20:08 (pid:8055) Returning from CStarter::JobReaper()
03/01/16 15:20:40 (pid:8055) Accepted request to reconnect from <130.246.180.120:9818>
03/01/16 15:20:40 (pid:8055) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2562>
03/01/16 15:20:40 (pid:8055) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2562>
03/01/16 15:20:40 (pid:8055) Job cleanup finished, now Starter is exiting
03/01/16 15:20:41 (pid:8055) **** condor_starter (condor_STARTER) pid 8055 EXITING WITH STATUS 0
03/01/16 15:20:41 (pid:11173) ******************************************************
03/01/16 15:20:41 (pid:11173) ** condor_starter (CONDOR_STARTER) STARTING UP
03/01/16 15:20:41 (pid:11173) ** /home/boinc/CMSRun/glide_OztfD8/main/condor/sbin/condor_starter
03/01/16 15:20:41 (pid:11173) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
03/01/16 15:20:41 (pid:11173) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/01/16 15:20:41 (pid:11173) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
03/01/16 15:20:41 (pid:11173) ** $CondorPlatform: x86_64_RedHat5 $
03/01/16 15:20:41 (pid:11173) ** PID = 11173
03/01/16 15:20:41 (pid:11173) ** Log last touched 3/1 15:20:41



This means, despite the failed condor write errors upload and verification worked.
It continued starting a new job in the same run.(However , the interruption was only 10 min)
ID: 2147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 198
Message 2149 - Posted: 1 Mar 2016, 15:15:53 UTC - in response to Message 2147.  

I had a job running. I unplugged the internet from 14.33-14.43 local.

Then then job finished and uploaded and was reported a success.

======== Stageout at Tue Mar 1 14:20:02 GMT 2016 FINISHING (short status 0) ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 1 14:20:02 GMT 2016 on 277-617-13328 with (short) status 0 ========
Local time: Tue Mar 1 15:20:02 CET 2016
Short exit status: 0
Job Running time in seconds: 3912

Yep, I've got that log at RAL successfully, and the result is on the Data-Bridge. Dashboard has it as a success too.
ID: 2149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2150 - Posted: 1 Mar 2016, 17:27:54 UTC

I suggest to set the CCB_Heartbeat_Interval to 0.

Comments?
ID: 2150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 482
Message 2151 - Posted: 1 Mar 2016, 19:15:37 UTC

I did a new test with suspend period of 1hr and 25 minutes
Job 2162 succeeded after resume and I got in the same glidein a new Job (2281)
ID: 2151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 544
Message 2152 - Posted: 1 Mar 2016, 23:02:37 UTC
Last modified: 1 Mar 2016, 23:10:03 UTC

Job 2302 suspended for ~1h45min (2006 GMT to 2150GMT) did not resume.

From MasterLog:-

03/01/16 19:49:15 (pid:7949) CLASSAD_CACHING is OFF
03/01/16 19:49:15 (pid:7949) Daemon Log is logging: D_ALWAYS D_ERROR
03/01/16 19:49:15 (pid:7949) DaemonCore: command socket at <10.0.2.15:55887?noUDP>
03/01/16 19:49:15 (pid:7949) DaemonCore: private command socket at <10.0.2.15:55887>
03/01/16 19:49:15 (pid:7949) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#134589
03/01/16 19:49:15 (pid:7949) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_master (mtime:1456861495)
03/01/16 19:49:16 (pid:7949) Started DaemonCore process "/home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd", pid and pgroup = 7953
03/01/16 21:50:34 (pid:7949) ERROR: Child pid 7953 appears hung! Killing it hard.

03/01/16 21:50:34 (pid:7949) condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:50156>.
03/01/16 21:50:34 (pid:7949) IO: Failed to read packet header
03/01/16 21:50:34 (pid:7949) AUTHENTICATE: handshake failed!
03/01/16 21:50:34 (pid:7949) DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake
03/01/16 21:50:34 (pid:7949) DefaultReaper unexpectedly called on pid 7953, status 9.
03/01/16 21:50:34 (pid:7949) The STARTD (pid 7953) was killed because it was no longer responding
03/01/16 21:50:34 (pid:7949) Sending obituary for "/home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd"
03/01/16 21:50:34 (pid:7949) restarting /home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd in 10 seconds
03/01/16 21:50:34 (pid:7949) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
03/01/16 21:50:34 (pid:7949) Got SIGTERM. Performing graceful shutdown.
03/01/16 21:50:34 (pid:7949) All daemons are gone. Exiting.
03/01/16 21:50:34 (pid:7949) **** condor_master (condor_MASTER) pid 7949 EXITING WITH STATUS 99
ID: 2152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 2153 - Posted: 1 Mar 2016, 23:02:53 UTC - in response to Message 2150.  

Hopefully we will be getting some guidance from the Condor developers soon. We should probably wait to see what they suggest.
ID: 2153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : News : Suspend/Resume


©2024 CERN