Thread 'Suspend/Resume/Disconnect'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2448 - Posted: 18 Mar 2016, 16:11:32 UTC I have tested 10min disconnect from internet. It failed, connection was not reestablished.Job was finished and uploaded, but stuck in WNPostproc(Where it is going to time out and rerun by a different host). 03/18/16 16:35:48 (pid:7659) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 8 03/18/16 16:35:48 (pid:7659) Buf::write(): condor_write() failed 03/18/16 16:35:48 (pid:7659) Failed to send job exit status to shadow 03/18/16 16:35:48 (pid:7659) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/18/16 16:35:48 (pid:7659) Returning from CStarter::JobReaper() Previously, it was possible, now it is not. Getting a new job is also a problem.Even 2h after the end of the disconnect, no new work has been started. I suggest to change the Claim_Worklife back to where it was. ID: 2448 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2503 - Posted: 22 Mar 2016, 15:23:56 UTC Last modified: 22 Mar 2016, 15:44:25 UTC Strange phenomenon: I suspended a task for 15 min.(LAIM on) After resume, i looked at the logs. The events where processed line after line,BUT there was no gap in time! The events were produced, as i watched, with a time stamp, that was 15min behind real time. How does that work? Does the vm no use the host system time? EDIT: The system time for this vm is now 15 min behind. Another VM, that was also running has the correct time. EDIT2: The task was ended prematurely. ID: 2503 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2513 - Posted: 22 Mar 2016, 22:46:39 UTC - in response to Message 2503. I believe I've seen that behaviour too, but eventually I wasn't sure. I've no idea how or how often the VM checks its clock (against ntp?). ID: 2513 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2514 - Posted: 22 Mar 2016, 22:50:08 UTC - in response to Message 2513. EDIT2: The task was ended prematurely. I will have to check, if that is related, or coincidence. ID: 2514 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2522 - Posted: 23 Mar 2016, 11:20:31 UTC - in response to Message 2513. .... I've no idea how or how often the VM checks its clock (against ntp?). I don't have any hosts working right now, but from memory if you watch the console messages as the VM boots it tries to contact an NTP server (pool.ntp.org). Also from memory, the Windows hosts here fail but the Linux hosts succeed so perhaps this interferes with setting up whatever NTP arrangement the VM uses. ID: 2522 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2527 - Posted: 23 Mar 2016, 13:16:55 UTC Last modified: 23 Mar 2016, 13:29:08 UTC EDIT2: The task was ended prematurely. I will have to check, if that is related, or coincidence. This is officially a problem, now. I suspended a task, resumed. Clock jumped, so time was correct. Next job, suspended-resumed, time was delayed by the amount of suspend time. Job finished, reported success, but then task exited prematurely. 2016-03-23 14:06:21 (2868): Guest Log: [INFO] CMS glidein Run 3 ended 2016-03-23 14:06:21 (2868): Guest Log: Log extracts for Run 3 jobs 2016-03-23 14:06:21 (2868): Guest Log: [INFO] No output. Shutting down! 2016-03-23 14:06:21 (2868): VM Completion File Detected. 2016-03-23 14:06:21 (2868): Powering off VM. 2016-03-23 14:06:22 (2868): Successfully stopped VM. ID: 2527 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2552 - Posted: 24 Mar 2016, 12:57:18 UTC - in response to Message 2527. Current setings CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1) NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 ID: 2552 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2553 - Posted: 24 Mar 2016, 13:14:51 UTC - in response to Message 2552. CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1) I just want to point out, that the line used to be different.Please recheck. http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=100&postid=2434#2434 ID: 2553 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2584 - Posted: 4 Apr 2016, 21:02:11 UTC - in response to Message 2553. Last modified: 4 Apr 2016, 21:02:39 UTC Today we had the pleasure of a visit from Miron Livny, the leader of the HTCondor project. He gave a very good talk on The Principals and Practice of Distributed High Throughput Computing, which among other things explained some of the ideas behind HTCondor. A video of the talk should be available at the following URL in a day or so. http://indico.cern.ch/event/512911/ While he was here, we took the opportunity to discuss the suspend/resume issue. ID: 2584 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2600 - Posted: 8 Apr 2016, 7:57:25 UTC If a task is suspended(and resumed), the run (job details) it is on, is not showing in the final log. At the same time (because of that?)the run will end and the (boinc)task exits with "No Output". I have had at least 3 cases, when that happend. ID: 2600 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 2620 - Posted: 9 Apr 2016, 14:10:46 UTC After a long time testing again to see how the situation is now. Suspended job 8328 with LAIM off from/to: vLHCathome-dev 09 Apr 10:13:10 task wu_1459860208_1306_0 suspended by user vLHCathome-dev 09 Apr 11:58:12 task wu_1459860208_1306_0 resumed by user The job 8328 runs on and did 75 events in 40 minutes and then it was decided to kill the job. StarterLog: 04/09/16 10:09:25 (pid:11076) Create_Process succeeded, pid=11081 04/09/16 11:59:36 (pid:11076) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 04/09/16 11:59:36 (pid:11076) IO: Failed to read packet header 04/09/16 12:01:35 (pid:11076) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 04/09/16 12:01:35 (pid:11076) IO: Failed to read packet header 04/09/16 12:01:35 (pid:11076) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 04/09/16 12:01:35 (pid:11076) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 04/09/16 12:02:36 (pid:11076) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#161201 04/09/16 12:04:37 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:04:37 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:09:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:09:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:14:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:14:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:19:38 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:19:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:24:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:24:39 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:29:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:29:39 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:34:41 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:34:41 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) Got SIGQUIT. Performing fast shutdown. 04/09/16 12:38:51 (pid:11076) ShutdownFast all jobs. 04/09/16 12:38:51 (pid:11076) Process exited, pid=11081, signal=9 04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 666 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 149 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) Failed to send job exit status to shadow 04/09/16 12:38:51 (pid:11076) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 04/09/16 12:38:51 (pid:11076) Returning from CStarter::JobReaper() StartdLog: 04/09/16 12:39:21 (pid:8055) starter (pid 11076) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents. 04/09/16 12:39:21 (pid:8055) Starter pid 11076 died on signal 9 (signal 9 (Killed)) 04/09/16 12:39:21 (pid:8055) State change: starter exited 04/09/16 12:39:21 (pid:8055) SECMAN: SetSessionExpiration failed to find session <10.0.2.15:40168>#1460181508#1 04/09/16 12:39:21 (pid:8055) State change: No preempting claim, returning to owner 04/09/16 12:39:21 (pid:8055) Changing state and activity: Preempting/Killing -> Owner/Idle 04/09/16 12:39:21 (pid:8055) State change: IS_OWNER is false 04/09/16 12:39:21 (pid:8055) Changing state: Owner -> Unclaimed 04/09/16 12:39:21 (pid:8055) State change: RunBenchmarks is TRUE 04/09/16 12:39:21 (pid:8055) Changing activity: Idle -> Benchmarking 04/09/16 12:39:21 (pid:8055) BenchMgr:StartBenchmarks() 04/09/16 12:39:25 (pid:8055) State change: RunBenchmarks is TRUE 04/09/16 12:39:25 (pid:8055) Changing activity: Benchmarking -> Idle 04/09/16 12:39:41 (pid:8055) State change: benchmarks completed 04/09/16 12:39:41 (pid:8055) No resources have been claimed for 30 seconds 04/09/16 12:39:41 (pid:8055) Shutting down Condor on this machine. 04/09/16 12:39:41 (pid:8055) Got SIGTERM. Performing graceful shutdown. 04/09/16 12:39:41 (pid:8055) shutdown graceful 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) Deleting cron job manager 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) Deleting benchmark job mgr 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'mips' 04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'mips' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_mips), timer -1 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'kflops' 04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'kflops' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_kflops), timer -1 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) All resources are free, exiting. 04/09/16 12:39:41 (pid:8055) **** condor_startd (condor_STARTD) pid 8055 EXITING WITH STATUS 0 ID: 2620 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3242 - Posted: 4 May 2016, 11:25:58 UTC Last modified: 4 May 2016, 11:28:39 UTC After suspending a tasks for 15min, there are no more entries in stderr.txt in the slot folder. Job started, Job finished etc is not shown after resume. The logs in "show graphivs appear normal. 2016-05-04 09:53:49 (3372): Guest Log: [INFO] CMS application starting. Check log files. 2016-05-04 09:54:49 (3372): Guest Log: [INFO] New Job Starting 2016-05-04 09:54:49 (3372): Guest Log: [INFO] Condor JobID: 1121023 2016-05-04 11:13:57 (3372): VM state change detected. (old = 'running', new = 'paused') 2016-05-04 11:30:50 (3372): VM state change detected. (old = 'paused', new = 'running') 2016-05-04 11:31:30 (3372): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2016-05-04 11:49:41 (3372): Status Report: Job Duration: '64800.000000' 2016-05-04 11:49:41 (3372): Status Report: Elapsed Time: '6009.148701' 2016-05-04 11:49:41 (3372): Status Report: CPU Time: '5104.890625' ID: 3242 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3244 - Posted: 4 May 2016, 11:38:28 UTC Last modified: 4 May 2016, 11:40:35 UTC Actually, the task, that was not suspended is doing the same. (suspend) was from 11.14 to 11.30 local time. 2016-05-04 11:26:49 (2104): Guest Log: [INFO] New Job Starting 2016-05-04 11:26:49 (2104): Guest Log: [INFO] Condor JobID: 1121116 2016-05-04 11:34:50 (2104): Status Report: Job Duration: '64800.000000' 2016-05-04 11:34:50 (2104): Status Report: Elapsed Time: '6001.548033' 2016-05-04 11:34:50 (2104): Status Report: CPU Time: '5099.093750' 2016-05-04 11:39:00 (2104): Guest Log: BIOS: VirtualBox 5.0.18 2016-05-04 13:14:58 (2104): Status Report: Job Duration: '64800.000000' 2016-05-04 13:14:58 (2104): Status Report: Elapsed Time: '12009.233246' 2016-05-04 13:14:58 (2104): Status Report: CPU Time: '10469.375000' 2016-05-04 13:32:50 (2104): Guest Log: 2.2.0.0 3366 0 21720 2615 14 1 1585739 10240001 2 65024 0 9 77.7778 10741 0 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/grid.cern.ch http://128.142.168.202:3125 1 ID: 3244 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 3248 - Posted: 4 May 2016, 13:37:22 UTC - in response to Message 3244. I am currently testing Suspend/Resume. Pausing (LAIM) for 30 minutes went fine and am now trying for 60 minutes. ID: 3248 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3249 - Posted: 4 May 2016, 15:57:18 UTC - in response to Message 3248. You will have to explicitly state on the website, that LAIM has to be turned on. Otherwise, it means nothing. For production purposes, worst case scenario is to be assumed. That means, it also should work, to some degree, with LAIM off. ID: 3249 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3250 - Posted: 4 May 2016, 16:23:29 UTC The vm (task), i suspended for 15min is now, 7 hours later, still 15min behind the host clock. ID: 3250 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 3251 - Posted: 4 May 2016, 17:39:38 UTC - in response to Message 2448. Please try this with the Theory app. This is where I am working. But don't try until tomorrow as I still need to do an update. ID: 3251 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3252 - Posted: 4 May 2016, 17:55:58 UTC I will try. A second suspend cause the job to be abandoned and a new one started. Time is now in sync with host. ID: 3252 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 3254 - Posted: 4 May 2016, 21:58:56 UTC - in response to Message 3252. Last modified: 4 May 2016, 21:59:47 UTC The current issue is that if the TCP connection breaks to the STARTD, it fails to reconnect with the same ccbid and fails with the message on the server this ccbid has no reconnect info! The job may continue till the end but in the StarterLog you will see the following. 05/04/16 19:25:51 Failed to send job exit status to shadow 05/04/16 19:25:51 JobExit() failed, waiting for job lease to expire or for a reconnect attempt EDIT: This is for the Theory app so we should switch this discussion there. ID: 3254 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3517 - Posted: 28 May 2016, 19:31:09 UTC Found this in starter.log: 05/28/16 16:34:32 (pid:6956) Create_Process succeeded, pid=6960 05/28/16 18:34:47 (pid:6956) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 05/28/16 18:34:47 (pid:6956) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 05/28/16 18:35:31 (pid:6956) Lost connection to shadow, waiting 7200 secs for reconnect 05/28/16 18:35:48 (pid:6956) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#92895 05/28/16 18:36:21 (pid:6956) Accepted request to reconnect from <:0> 05/28/16 18:36:21 (pid:6956) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018> 05/28/16 18:36:21 (pid:6956) Communicating with shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018> 05/28/16 18:36:21 (pid:6956) Recovered connection to shadow after 50 seconds 05/28/16 18:59:01 (pid:6956) Process exited, pid=6960, status=0 05/28/16 18:59:03 (pid:6956) Got SIGQUIT. Performing fast shutdown. ID: 3517 · Rating: 0 · rate: / Reply Quote

Development for LHC@home