Suspend/Resume

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2131 - Posted: 29 Feb 2016, 21:32:49 UTC - in response to Message 2129. OK. We are targeting suspending up to 1 hour. After the 1st job started with a new run, I paused the VM for 30 minutes. Results after resume: - Events went on processing (job 1610) - Job result uploaded OK - New Job (1671) started without long interruption in the same glidein run - Postprocessing lasted a bit longer (or was I impatient) but finally - Job finished successful on Dashboard. ID: 2131 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2132 - Posted: 29 Feb 2016, 21:47:06 UTC - in response to Message 2131. Last modified: 29 Feb 2016, 21:48:17 UTC Good news! What are you trying next 40min? I will try the next one up(50min). I am trying 21 min, but not finished, yet. ID: 2132 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2133 - Posted: 29 Feb 2016, 22:01:51 UTC - in response to Message 2132. What are you trying next 40min? I try 360min of suspending myself ;) In 1 hour it's finally March. New month, new day, new chances! ID: 2133 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2134 - Posted: 29 Feb 2016, 22:05:43 UTC - in response to Message 2133. I try 360min of suspending myself ;) 6 hours, that is ambitious. ID: 2134 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2135 - Posted: 29 Feb 2016, 23:05:40 UTC After the 1st job started with a new run, I paused the VM for 30 minutes. I did a test with 21min suspend and it failed; same as before. Are you sure, it worked with 30min? ID: 2135 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2136 - Posted: 1 Mar 2016, 0:30:05 UTC I found, that the CCB_Heartbeat also serves to keep the network connection to the server open.However, if there is no traffic on this line for a while, the ISP server considers the line dead and disconnects. This is independent of any cern-server or host settings. One solution could be to set the CCB_Heartbeat_Interval to 0. According to the manual, that is possible. This should also cause a proper reconnect to the server instead of trying to send information over a line, that is presumed open(but closed). The other option would be to investigate other CCB related parameters, which i do not want to get into, at this point. This "line dead" time could be different between ISPs, which would explain, why different volunteers have different timeouts. ID: 2136 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2137 - Posted: 1 Mar 2016, 7:44:28 UTC - in response to Message 2135. Last modified: 1 Mar 2016, 7:59:28 UTC Are you sure, it worked with 30min? Yeah, it was absolutely error free and job 1610 has finished state on the dashboard as next job 1671 has. Begin processing the 2nd record. Run 1, Event 402252, LumiSection 4828 at 29-Feb-2016 19:08:56.158 CET Begin processing the 3rd record. Run 1, Event 402253, LumiSection 4828 at 29-Feb-2016 19:46:40.065 CET Now testing with a suspend period of 55 minutes. ID: 2137 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 326,570 RAC: 95	Message 2138 - Posted: 1 Mar 2016, 8:54:27 UTC - in response to Message 2136. It should try to reconnect. Do you have the StarterLog available? ID: 2138 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2139 - Posted: 1 Mar 2016, 9:10:03 UTC - in response to Message 2138. Last modified: 1 Mar 2016, 9:15:52 UTC It should try to reconnect. Do you have the StarterLog available? I suppose you mean Rasputin, but in the StarterLog is only 02/29/16 19:04:17 (pid:3363) Create_Process succeeded, pid=3367 02/29/16 22:03:59 (pid:3363) Process exited, pid=3367, status=0 No sign of a reconnect. The 30 minutes suspend period is in that time frame. See former post. Off topic: Could it be that the 77MB root file is uploaded twice? ID: 2139 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 544	Message 2141 - Posted: 1 Mar 2016, 10:07:02 UTC - in response to Message 2136. Last modified: 1 Mar 2016, 10:37:01 UTC I found, that the CCB_Heartbeat also serves to keep the network connection to the server open.However, if there is no traffic on this line for a while, the ISP server considers the line dead and disconnects. This is independent of any cern-server or host settings. One solution could be to set the CCB_Heartbeat_Interval to 0. According to the manual, that is possible. This should also cause a proper reconnect to the server instead of trying to send information over a line, that is presumed open(but closed). The other option would be to investigate other CCB related parameters, which i do not want to get into, at this point. This "line dead" time could be different between ISPs, which would explain, why different volunteers have different timeouts. It might not be your ISP. My router (the PPP part) has this facility, this is from the setup help:- "Idle Timeout An idle Internet connection will be terminated after this time period. If this value is zero (0), then the connection will be "kept alive" by re-connecting immediately whenever the connection is lost." It implies that the connection is always allowed to die at some point. Edit: suppose that the ISP could do it as well but I don't think much of my chances of getting them to tell me if they do and what the timeout is. ID: 2141 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2142 - Posted: 1 Mar 2016, 10:31:32 UTC - in response to Message 2137. Now testing with a suspend period of 55 minutes. The job survived the 55 minutes suspend and is on Dashboard recorded as finished properly. 1 strange thing: Start 01-03 06:45:40 Finish 01-03 10:06:21 WallTime 24:00:01 This is maybe caused while I was not the initial receiver of the job. The WNIp is from someone else. Next Job started OK, but is was in a new Glide run, cause the former run had run more than 6 hours. ID: 2142 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2143 - Posted: 1 Mar 2016, 11:27:42 UTC Last modified: 1 Mar 2016, 12:01:13 UTC I have the Starter log. First, i had a task suspended for 18min-all fine(job1653). (start20.48local) Then i suspended for 21min- fail(job1683).(start 22.10local) I left it and after about 1hour, it started a new task and finished normally (job1634).(start00.51local). Then it created a new run and continued. EDIT:see bottom for timeout reason. STARTERLOG: 02/29/16 19:20:33 (pid:7637) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 19:20:33 (pid:7637) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 19:20:33 (pid:7637) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 19:20:42 (pid:7647) **************************************************** 02/29/16 19:20:42 (pid:7647) condor_starter (CONDOR_STARTER) STARTING UP 02/29/16 19:20:42 (pid:7647) /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter 02/29/16 19:20:42 (pid:7647) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 02/29/16 19:20:42 (pid:7647) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 02/29/16 19:20:42 (pid:7647) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 02/29/16 19:20:42 (pid:7647) $CondorPlatform: x86_64_RedHat5 $ 02/29/16 19:20:42 (pid:7647) PID = 7647 02/29/16 19:20:42 (pid:7647) Log last touched 2/29 19:20:33 02/29/16 19:20:42 (pid:7647) ************************************************** 02/29/16 19:20:42 (pid:7647) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config 02/29/16 19:20:42 (pid:7647) config Macros = 217, Sorted = 217, StringBytes = 10777, TablesBytes = 7852 02/29/16 19:20:42 (pid:7647) CLASSAD_CACHING is OFF 02/29/16 19:20:42 (pid:7647) Daemon Log is logging: D_ALWAYS D_ERROR 02/29/16 19:20:42 (pid:7647) DaemonCore: command socket at <10.0.2.15:37517?noUDP> 02/29/16 19:20:42 (pid:7647) DaemonCore: private command socket at <10.0.2.15:37517> 02/29/16 19:20:42 (pid:7647) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133741 02/29/16 19:20:42 (pid:7647) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_1979> 02/29/16 19:20:42 (pid:7647) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 02/29/16 19:20:42 (pid:7647) setting the orig job name in starter 02/29/16 19:20:42 (pid:7647) setting the orig job iwd in starter 02/29/16 19:20:42 (pid:7647) Chirp config summary: IO false, Updates false, Delayed updates true. 02/29/16 19:20:42 (pid:7647) Initialized IO Proxy. 02/29/16 19:20:42 (pid:7647) Done setting resource limits 02/29/16 19:20:42 (pid:7647) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 19:20:42 (pid:7647) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 19:20:46 (pid:7647) File transfer completed successfully. 02/29/16 19:20:46 (pid:7647) Job 1029118.0 set to execute immediately 02/29/16 19:20:46 (pid:7647) Starting a VANILLA universe job with ID: 1029118.0 02/29/16 19:20:46 (pid:7647) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647 02/29/16 19:20:46 (pid:7647) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/_condor_stdout 02/29/16 19:20:46 (pid:7647) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/_condor_stderr 02/29/16 19:20:46 (pid:7647) Renice expr "0" evaluated to 0 02/29/16 19:20:46 (pid:7647) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_7647/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1616 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1616.txt --runAndLumis=job_lumis_1616.json --lheInputFiles=False --firstEvent=403751 --firstLumi=4846 --lastEvent=404001 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 02/29/16 19:20:46 (pid:7647) Running job as user (null) 02/29/16 19:20:46 (pid:7647) Create_Process succeeded, pid=7654 02/29/16 20:48:07 (pid:7647) Process exited, pid=7654, status=0 02/29/16 20:48:08 (pid:7647) Got SIGQUIT. Performing fast shutdown. 02/29/16 20:48:08 (pid:7647) ShutdownFast all jobs. 02/29/16 20:48:08 (pid:7647) condor_starter (condor_STARTER) pid 7647 EXITING WITH STATUS 0 02/29/16 20:48:09 (pid:11892) ************************************************** 02/29/16 20:48:09 (pid:11892) condor_starter (CONDOR_STARTER) STARTING UP 02/29/16 20:48:09 (pid:11892) /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter 02/29/16 20:48:09 (pid:11892) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 02/29/16 20:48:09 (pid:11892) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 02/29/16 20:48:09 (pid:11892) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 02/29/16 20:48:09 (pid:11892) $CondorPlatform: x86_64_RedHat5 $ 02/29/16 20:48:09 (pid:11892) PID = 11892 02/29/16 20:48:09 (pid:11892) Log last touched 2/29 20:48:08 02/29/16 20:48:09 (pid:11892) ************************************************** 02/29/16 20:48:09 (pid:11892) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config 02/29/16 20:48:09 (pid:11892) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852 02/29/16 20:48:09 (pid:11892) CLASSAD_CACHING is OFF 02/29/16 20:48:09 (pid:11892) Daemon Log is logging: D_ALWAYS D_ERROR 02/29/16 20:48:09 (pid:11892) DaemonCore: command socket at <10.0.2.15:49153?noUDP> 02/29/16 20:48:09 (pid:11892) DaemonCore: private command socket at <10.0.2.15:49153> 02/29/16 20:48:11 (pid:11892) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133789 02/29/16 20:48:11 (pid:11892) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021> 02/29/16 20:48:11 (pid:11892) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 02/29/16 20:48:11 (pid:11892) setting the orig job name in starter 02/29/16 20:48:11 (pid:11892) setting the orig job iwd in starter 02/29/16 20:48:11 (pid:11892) Chirp config summary: IO false, Updates false, Delayed updates true. 02/29/16 20:48:11 (pid:11892) Initialized IO Proxy. 02/29/16 20:48:11 (pid:11892) Done setting resource limits 02/29/16 20:48:11 (pid:11892) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 20:48:11 (pid:11892) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 20:48:15 (pid:11892) File transfer completed successfully. 02/29/16 20:48:15 (pid:11892) Job 1029154.0 set to execute immediately 02/29/16 20:48:15 (pid:11892) Starting a VANILLA universe job with ID: 1029154.0 02/29/16 20:48:15 (pid:11892) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892 02/29/16 20:48:15 (pid:11892) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/_condor_stdout 02/29/16 20:48:15 (pid:11892) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/_condor_stderr 02/29/16 20:48:15 (pid:11892) Renice expr "0" evaluated to 0 02/29/16 20:48:15 (pid:11892) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_11892/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1653 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1653.txt --runAndLumis=job_lumis_1653.json --lheInputFiles=False --firstEvent=413001 --firstLumi=4957 --lastEvent=413251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 02/29/16 20:48:15 (pid:11892) Running job as user (null) 02/29/16 20:48:15 (pid:11892) Create_Process succeeded, pid=11899 02/29/16 21:11:33 (pid:11892) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/29/16 21:11:33 (pid:11892) IO: Failed to read packet header 02/29/16 21:12:20 (pid:11892) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 02/29/16 21:12:20 (pid:11892) IO: Failed to read packet header 02/29/16 21:12:20 (pid:11892) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 02/29/16 21:12:20 (pid:11892) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 02/29/16 21:13:21 (pid:11892) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133789 02/29/16 21:16:33 (pid:11892) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 21:16:33 (pid:11892) Buf::write(): condor_write() failed 02/29/16 21:16:53 (pid:11892) Accepted request to reconnect from <130.246.180.120:9818> 02/29/16 21:16:53 (pid:11892) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021> 02/29/16 21:16:53 (pid:11892) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2021> 02/29/16 22:10:16 (pid:11892) Process exited, pid=11899, status=0 02/29/16 22:10:18 (pid:11892) Got SIGQUIT. Performing fast shutdown. 02/29/16 22:10:18 (pid:11892) ShutdownFast all jobs. 02/29/16 22:10:18 (pid:11892) condor_starter (condor_STARTER) pid 11892 EXITING WITH STATUS 0 02/29/16 22:10:19 (pid:14925) ************************************************** 02/29/16 22:10:19 (pid:14925) condor_starter (CONDOR_STARTER) STARTING UP 02/29/16 22:10:19 (pid:14925) /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter 02/29/16 22:10:19 (pid:14925) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 02/29/16 22:10:19 (pid:14925) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 02/29/16 22:10:19 (pid:14925) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 02/29/16 22:10:19 (pid:14925) $CondorPlatform: x86_64_RedHat5 $ 02/29/16 22:10:19 (pid:14925) PID = 14925 02/29/16 22:10:19 (pid:14925) Log last touched 2/29 22:10:18 02/29/16 22:10:19 (pid:14925) ************************************************** 02/29/16 22:10:19 (pid:14925) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config 02/29/16 22:10:19 (pid:14925) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852 02/29/16 22:10:19 (pid:14925) CLASSAD_CACHING is OFF 02/29/16 22:10:19 (pid:14925) Daemon Log is logging: D_ALWAYS D_ERROR 02/29/16 22:10:19 (pid:14925) DaemonCore: command socket at <10.0.2.15:48600?noUDP> 02/29/16 22:10:19 (pid:14925) DaemonCore: private command socket at <10.0.2.15:48600> 02/29/16 22:10:20 (pid:14925) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133839 02/29/16 22:10:20 (pid:14925) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2051> 02/29/16 22:10:20 (pid:14925) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 02/29/16 22:10:20 (pid:14925) setting the orig job name in starter 02/29/16 22:10:20 (pid:14925) setting the orig job iwd in starter 02/29/16 22:10:20 (pid:14925) Chirp config summary: IO false, Updates false, Delayed updates true. 02/29/16 22:10:20 (pid:14925) Initialized IO Proxy. 02/29/16 22:10:20 (pid:14925) Done setting resource limits 02/29/16 22:10:20 (pid:14925) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 22:10:20 (pid:14925) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 02/29/16 22:10:24 (pid:14925) File transfer completed successfully. 02/29/16 22:10:24 (pid:14925) Job 1029183.0 set to execute immediately 02/29/16 22:10:24 (pid:14925) Starting a VANILLA universe job with ID: 1029183.0 02/29/16 22:10:24 (pid:14925) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925 02/29/16 22:10:24 (pid:14925) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/_condor_stdout 02/29/16 22:10:24 (pid:14925) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/_condor_stderr 02/29/16 22:10:24 (pid:14925) Renice expr "0" evaluated to 0 02/29/16 22:10:24 (pid:14925) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_14925/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1683 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1683.txt --runAndLumis=job_lumis_1683.json --lheInputFiles=False --firstEvent=420501 --firstLumi=5047 --lastEvent=420751 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 02/29/16 22:10:24 (pid:14925) Running job as user (null) 02/29/16 22:10:24 (pid:14925) Create_Process succeeded, pid=14931 02/29/16 22:42:37 (pid:14925) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 02/29/16 22:42:37 (pid:14925) IO: Failed to read packet header 02/29/16 22:43:37 (pid:14925) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 02/29/16 22:43:37 (pid:14925) IO: Failed to read packet header 02/29/16 22:43:37 (pid:14925) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 02/29/16 22:43:37 (pid:14925) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 02/29/16 22:44:38 (pid:14925) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133839 02/29/16 22:47:38 (pid:14925) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 22:47:38 (pid:14925) Buf::write(): condor_write() failed 02/29/16 22:52:38 (pid:14925) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 22:52:38 (pid:14925) Buf::write(): condor_write() failed 02/29/16 22:57:39 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 22:57:39 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:02:40 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:02:40 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:07:40 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:07:40 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:12:41 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:12:41 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:17:42 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:17:42 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:22:42 (pid:14925) condor_write(): Socket closed when trying to write 564 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:22:42 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:27:42 (pid:14925) condor_write(): Socket closed when trying to write 564 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:27:42 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:32:43 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:32:43 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:37:44 (pid:14925) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:37:44 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:40:35 (pid:14925) Process exited, pid=14931, status=0 02/29/16 23:40:35 (pid:14925) condor_write(): Socket closed when trying to write 618 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:40:35 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:40:38 (pid:14925) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11 02/29/16 23:40:38 (pid:14925) Buf::write(): condor_write() failed 02/29/16 23:40:38 (pid:14925) Failed to send job exit status to shadow 02/29/16 23:40:38 (pid:14925) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 02/29/16 23:40:38 (pid:14925) Returning from CStarter::JobReaper() 03/01/16 00:50:27 (pid:14925) Got SIGQUIT. Performing fast shutdown. 03/01/16 00:50:27 (pid:14925) ShutdownFast all jobs. 03/01/16 00:50:27 (pid:14925) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 00:50:27 (pid:14925) Buf::write(): condor_write() failed 03/01/16 00:50:27 (pid:14925) Failed to send job exit status to shadow 03/01/16 00:50:27 (pid:14925) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/01/16 00:51:06 (pid:21133) ************************************************** 03/01/16 00:51:06 (pid:21133) condor_starter (CONDOR_STARTER) STARTING UP 03/01/16 00:51:06 (pid:21133) /home/boinc/CMSRun/glide_q9Cf9U/main/condor/sbin/condor_starter 03/01/16 00:51:06 (pid:21133) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 03/01/16 00:51:06 (pid:21133) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 03/01/16 00:51:06 (pid:21133) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 03/01/16 00:51:06 (pid:21133) $CondorPlatform: x86_64_RedHat5 $ 03/01/16 00:51:06 (pid:21133) PID = 21133 03/01/16 00:51:06 (pid:21133) Log last touched 3/1 00:50:41 03/01/16 00:51:06 (pid:21133) ************************************************** 03/01/16 00:51:06 (pid:21133) Using config source: /home/boinc/CMSRun/glide_q9Cf9U/condor_config 03/01/16 00:51:06 (pid:21133) config Macros = 217, Sorted = 217, StringBytes = 10779, TablesBytes = 7852 03/01/16 00:51:06 (pid:21133) CLASSAD_CACHING is OFF 03/01/16 00:51:06 (pid:21133) Daemon Log is logging: D_ALWAYS D_ERROR 03/01/16 00:51:06 (pid:21133) DaemonCore: command socket at <10.0.2.15:51596?noUDP> 03/01/16 00:51:06 (pid:21133) DaemonCore: private command socket at <10.0.2.15:51596> 03/01/16 00:51:06 (pid:21133) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#133939 03/01/16 00:51:06 (pid:21133) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2133> 03/01/16 00:51:06 (pid:21133) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 03/01/16 00:51:06 (pid:21133) setting the orig job name in starter 03/01/16 00:51:06 (pid:21133) setting the orig job iwd in starter 03/01/16 00:51:06 (pid:21133) Chirp config summary: IO false, Updates false, Delayed updates true. 03/01/16 00:51:06 (pid:21133) Initialized IO Proxy. 03/01/16 00:51:06 (pid:21133) Done setting resource limits 03/01/16 00:51:06 (pid:21133) FILETRANSFER: "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 03/01/16 00:51:06 (pid:21133) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_q9Cf9U/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 03/01/16 00:51:10 (pid:21133) File transfer completed successfully. 03/01/16 00:51:10 (pid:21133) Job 1029140.0 set to execute immediately 03/01/16 00:51:10 (pid:21133) Starting a VANILLA universe job with ID: 1029140.0 03/01/16 00:51:10 (pid:21133) IWD: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133 03/01/16 00:51:10 (pid:21133) Output file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/_condor_stdout 03/01/16 00:51:10 (pid:21133) Error file: /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/_condor_stderr 03/01/16 00:51:10 (pid:21133) Renice expr "0" evaluated to 0 03/01/16 00:51:10 (pid:21133) Using wrapper /home/boinc/CMSRun/glide_q9Cf9U/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_q9Cf9U/execute/dir_21133/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=1634 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1634.txt --runAndLumis=job_lumis_1634.json --lheInputFiles=False --firstEvent=408251 --firstLumi=4900 --lastEvent=408501 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 03/01/16 00:51:10 (pid:21133) Running job as user (null) 03/01/16 00:51:10 (pid:21133) Create_Process succeeded, pid=21140 03/01/16 01:47:39 (pid:21133) Process exited, pid=21140, status=0 03/01/16 01:47:40 (pid:21133) Got SIGQUIT. Performing fast shutdown. 03/01/16 01:47:40 (pid:21133) ShutdownFast all jobs. 03/01/16 01:47:41 (pid:21133) ** condor_starter (condor_STARTER) pid 21133 EXITING WITH STATUS 0 EXTRACT FROM STARDLOG: 03/01/16 00:50:21 (pid:7634) Response problem from schedd <130.246.180.120:9818?noUDP&sock=2939_634c> on ALIVE job 1029183.0. 03/01/16 00:50:27 (pid:7634) State change: claim no longer recognized by the schedd - removing claim 03/01/16 00:50:27 (pid:7634) Changing state and activity: Claimed/Busy -> Preempting/Killing 03/01/16 00:50:57 (pid:7634) starter (pid 14925) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents. 03/01/16 00:50:57 (pid:7634) Starter pid 14925 died on signal 9 (signal 9 (Killed)) 03/01/16 00:50:57 (pid:7634) State change: starter exited 03/01/16 00:50:57 (pid:7634) SECMAN: SetSessionExpiration failed to find session <10.0.2.15:44134>#1456770033#1 03/01/16 00:50:57 (pid:7634) State change: No preempting claim, returning to owner 03/01/16 00:50:57 (pid:7634) Changing state and activity: Preempting/Killing -> Owner/Idle 03/01/16 00:50:57 (pid:7634) State change: IS_OWNER is false 03/01/16 00:50:57 (pid:7634) Changing state: Owner -> Unclaimed 03/01/16 00:50:57 (pid:7634) State change: RunBenchmarks is TRUE 03/01/16 00:50:57 (pid:7634) Changing activity: Idle -> Benchmarking 03/01/16 00:50:57 (pid:7634) BenchMgr:StartBenchmarks() 03/01/ ID: 2143 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2145 - Posted: 1 Mar 2016, 12:07:53 UTC It looks like you suspended the task not during processing the events, but during loading packet information for the cmsRun just started. That could be critical and caused the failure. ID: 2145 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2146 - Posted: 1 Mar 2016, 12:14:37 UTC - in response to Message 2145. Last modified: 1 Mar 2016, 12:17:07 UTC Not correct: Begin processing the 29th record. Run 1, Event 420529, LumiSection 5047 at 29-Feb-2016 22:19:55.366 CET Begin processing the 30th record. Run 1, Event 420530, LumiSection 5047 at 29-Feb-2016 22:19:59.189 CET Begin processing the 31st record. Run 1, Event 420531, LumiSection 5047 at 29-Feb-2016 22:20:05.174 CET Begin processing the 32nd record. Run 1, Event 420532, LumiSection 5047 at 29-Feb-2016 22:20:13.592 CET Begin processing the 33rd record. Run 1, Event 420533, LumiSection 5047 at 29-Feb-2016 22:20:48.787 CET Begin processing the 34th record. Run 1, Event 420534, LumiSection 5047 at 29-Feb-2016 22:21:09.296 CET Begin processing the 35th record. Run 1, Event 420535, LumiSection 5047 at 29-Feb-2016 22:21:09.306 CET Begin processing the 36th record. Run 1, Event 420536, LumiSection 5047 at 29-Feb-2016 22:42:42.785 CET Begin processing the 37th record. Run 1, Event 420537, LumiSection 5047 at 29-Feb-2016 22:43:00.068 CET Begin processing the 38th record. Run 1, Event 420538, LumiSection 5047 at 29-Feb-2016 22:43:13.093 CET Begin processing the 39th record. Run 1, Event 420539, LumiSection 5047 at 29-Feb-2016 22:43:47.101 CET Begin processing the 40th record. Run 1, Event 420540, LumiSection 5047 at 29-Feb-2016 22:44:16.164 CET Begin processing the 41st record. Run 1, Event 420541, LumiSection 5047 at 29-Feb-2016 22:44:18.219 CET Begin processing the 42nd record. Run 1, Event 420542, LumiSection 5047 at 29-Feb-2016 22:44:20.965 CET Begin p ID: 2146 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2147 - Posted: 1 Mar 2016, 14:37:29 UTC Last modified: 1 Mar 2016, 14:39:33 UTC I had a job running. I unplugged the internet from 14.33-14.43 local. 03/01/16 14:14:50 (pid:8055) Using wrapper /home/boinc/CMSRun/glide_OztfD8/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_OztfD8/execute/dir_8055/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=2101 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_2101.txt --runAndLumis=job_lumis_2101.json --lheInputFiles=False --firstEvent=525001 --firstLumi=6301 --lastEvent=525251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 03/01/16 14:14:50 (pid:8055) Running job as user (null) 03/01/16 14:14:50 (pid:8055) Create_Process succeeded, pid=8061 03/01/16 14:40:01 (pid:8055) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 14:40:01 (pid:8055) Buf::write(): condor_write() failed 03/01/16 14:45:02 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 14:45:02 (pid:8055) Buf::write(): condor_write() failed 03/01/16 14:50:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 14:50:04 (pid:8055) Buf::write(): condor_write() failed 03/01/16 14:55:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 14:55:04 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:00:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 15:00:04 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:05:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 Then then job finished and uploaded and was reported a success. ======== Stageout at Tue Mar 1 14:20:02 GMT 2016 FINISHING (short status 0) ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 1 14:20:02 GMT 2016 on 277-617-13328 with (short) status 0 ======== Local time: Tue Mar 1 15:20:02 CET 2016 Short exit status: 0 Job Running time in seconds: 3912 03/01/16 15:05:04 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:10:04 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 15:10:04 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:15:05 (pid:8055) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622 03/01/16 15:15:05 (pid:8055) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds. 03/01/16 15:15:05 (pid:8055) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 15:15:05 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:16:15 (pid:8055) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#128708 03/01/16 15:20:02 (pid:8055) Process exited, pid=8061, status=0 03/01/16 15:20:02 (pid:8055) condor_write(): Socket closed when trying to write 617 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 15:20:02 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:20:08 (pid:8055) condor_write(): Socket closed when trying to write 193 bytes to <130.246.180.120:9818>, fd is 11 03/01/16 15:20:08 (pid:8055) Buf::write(): condor_write() failed 03/01/16 15:20:08 (pid:8055) Failed to send job exit status to shadow 03/01/16 15:20:08 (pid:8055) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/01/16 15:20:08 (pid:8055) Returning from CStarter::JobReaper() 03/01/16 15:20:40 (pid:8055) Accepted request to reconnect from <130.246.180.120:9818> 03/01/16 15:20:40 (pid:8055) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2562> 03/01/16 15:20:40 (pid:8055) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2939_634c_2562> 03/01/16 15:20:40 (pid:8055) Job cleanup finished, now Starter is exiting 03/01/16 15:20:41 (pid:8055) ** condor_starter (condor_STARTER) pid 8055 EXITING WITH STATUS 0 03/01/16 15:20:41 (pid:11173) ************************************************** 03/01/16 15:20:41 (pid:11173) condor_starter (CONDOR_STARTER) STARTING UP 03/01/16 15:20:41 (pid:11173) /home/boinc/CMSRun/glide_OztfD8/main/condor/sbin/condor_starter 03/01/16 15:20:41 (pid:11173) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 03/01/16 15:20:41 (pid:11173) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 03/01/16 15:20:41 (pid:11173) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 03/01/16 15:20:41 (pid:11173) $CondorPlatform: x86_64_RedHat5 $ 03/01/16 15:20:41 (pid:11173) PID = 11173 03/01/16 15:20:41 (pid:11173) ** Log last touched 3/1 15:20:41 This means, despite the failed condor write errors upload and verification worked. It continued starting a new job in the same run.(However , the interruption was only 10 min) ID: 2147 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 198	Message 2149 - Posted: 1 Mar 2016, 15:15:53 UTC - in response to Message 2147. I had a job running. I unplugged the internet from 14.33-14.43 local. Then then job finished and uploaded and was reported a success. ======== Stageout at Tue Mar 1 14:20:02 GMT 2016 FINISHING (short status 0) ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 1 14:20:02 GMT 2016 on 277-617-13328 with (short) status 0 ======== Local time: Tue Mar 1 15:20:02 CET 2016 Short exit status: 0 Job Running time in seconds: 3912 Yep, I've got that log at RAL successfully, and the result is on the Data-Bridge. Dashboard has it as a success too. ID: 2149 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2150 - Posted: 1 Mar 2016, 17:27:54 UTC I suggest to set the CCB_Heartbeat_Interval to 0. Comments? ID: 2150 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 482	Message 2151 - Posted: 1 Mar 2016, 19:15:37 UTC I did a new test with suspend period of 1hr and 25 minutes Job 2162 succeeded after resume and I got in the same glidein a new Job (2281) ID: 2151 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 544	Message 2152 - Posted: 1 Mar 2016, 23:02:37 UTC Last modified: 1 Mar 2016, 23:10:03 UTC Job 2302 suspended for ~1h45min (2006 GMT to 2150GMT) did not resume. From MasterLog:- 03/01/16 19:49:15 (pid:7949) CLASSAD_CACHING is OFF 03/01/16 19:49:15 (pid:7949) Daemon Log is logging: D_ALWAYS D_ERROR 03/01/16 19:49:15 (pid:7949) DaemonCore: command socket at <10.0.2.15:55887?noUDP> 03/01/16 19:49:15 (pid:7949) DaemonCore: private command socket at <10.0.2.15:55887> 03/01/16 19:49:15 (pid:7949) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#134589 03/01/16 19:49:15 (pid:7949) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_master (mtime:1456861495) 03/01/16 19:49:16 (pid:7949) Started DaemonCore process "/home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd", pid and pgroup = 7953 03/01/16 21:50:34 (pid:7949) ERROR: Child pid 7953 appears hung! Killing it hard. 03/01/16 21:50:34 (pid:7949) condor_read() failed: recv(fd=13) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:50156>. 03/01/16 21:50:34 (pid:7949) IO: Failed to read packet header 03/01/16 21:50:34 (pid:7949) AUTHENTICATE: handshake failed! 03/01/16 21:50:34 (pid:7949) DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake 03/01/16 21:50:34 (pid:7949) DefaultReaper unexpectedly called on pid 7953, status 9. 03/01/16 21:50:34 (pid:7949) The STARTD (pid 7953) was killed because it was no longer responding 03/01/16 21:50:34 (pid:7949) Sending obituary for "/home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd" 03/01/16 21:50:34 (pid:7949) restarting /home/boinc/CMSRun/glide_FZB3IS/main/condor/sbin/condor_startd in 10 seconds 03/01/16 21:50:34 (pid:7949) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 03/01/16 21:50:34 (pid:7949) Got SIGTERM. Performing graceful shutdown. 03/01/16 21:50:34 (pid:7949) All daemons are gone. Exiting. 03/01/16 21:50:34 (pid:7949) **** condor_master (condor_MASTER) pid 7949 EXITING WITH STATUS 99 ID: 2152 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 326,570 RAC: 95	Message 2153 - Posted: 1 Mar 2016, 23:02:53 UTC - in response to Message 2150. Hopefully we will be getting some guidance from the Condor developers soon. We should probably wait to see what they suggest. ID: 2153 · Rating: 0 · rate: / Reply Quote

Development for LHC@home