Message boards : CMS Application : Suspend/Resume/Disconnect
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I have tested 10min disconnect from internet. It failed, connection was not reestablished.Job was finished and uploaded, but stuck in WNPostproc(Where it is going to time out and rerun by a different host). 03/18/16 16:35:48 (pid:7659) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 8 03/18/16 16:35:48 (pid:7659) Buf::write(): condor_write() failed 03/18/16 16:35:48 (pid:7659) Failed to send job exit status to shadow 03/18/16 16:35:48 (pid:7659) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 03/18/16 16:35:48 (pid:7659) Returning from CStarter::JobReaper() Previously, it was possible, now it is not. Getting a new job is also a problem.Even 2h after the end of the disconnect, no new work has been started. I suggest to change the Claim_Worklife back to where it was. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Strange phenomenon: I suspended a task for 15 min.(LAIM on) After resume, i looked at the logs. The events where processed line after line,BUT there was no gap in time! The events were produced, as i watched, with a time stamp, that was 15min behind real time. How does that work? Does the vm no use the host system time? EDIT: The system time for this vm is now 15 min behind. Another VM, that was also running has the correct time. EDIT2: The task was ended prematurely. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I believe I've seen that behaviour too, but eventually I wasn't sure. I've no idea how or how often the VM checks its clock (against ntp?). ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
EDIT2: The task was ended prematurely. I will have to check, if that is related, or coincidence. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
.... I've no idea how or how often the VM checks its clock (against ntp?). I don't have any hosts working right now, but from memory if you watch the console messages as the VM boots it tries to contact an NTP server (pool.ntp.org). Also from memory, the Windows hosts here fail but the Linux hosts succeed so perhaps this interferes with setting up whatever NTP arrangement the VM uses. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
EDIT2: The task was ended prematurely. This is officially a problem, now. I suspended a task, resumed. Clock jumped, so time was correct. Next job, suspended-resumed, time was delayed by the amount of suspend time. Job finished, reported success, but then task exited prematurely. 2016-03-23 14:06:21 (2868): Guest Log: [INFO] CMS glidein Run 3 ended |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Current setings CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1) NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1) I just want to point out, that the line used to be different.Please recheck. http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=100&postid=2434#2434 |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Today we had the pleasure of a visit from Miron Livny, the leader of the HTCondor project. He gave a very good talk on The Principals and Practice of Distributed High Throughput Computing, which among other things explained some of the ideas behind HTCondor. A video of the talk should be available at the following URL in a day or so. http://indico.cern.ch/event/512911/ While he was here, we took the opportunity to discuss the suspend/resume issue. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
If a task is suspended(and resumed), the run (job details) it is on, is not showing in the final log. At the same time (because of that?)the run will end and the (boinc)task exits with "No Output". I have had at least 3 cases, when that happend. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
After a long time testing again to see how the situation is now. Suspended job 8328 with LAIM off from/to: vLHCathome-dev 09 Apr 10:13:10 task wu_1459860208_1306_0 suspended by user vLHCathome-dev 09 Apr 11:58:12 task wu_1459860208_1306_0 resumed by user The job 8328 runs on and did 75 events in 40 minutes and then it was decided to kill the job. StarterLog: 04/09/16 10:09:25 (pid:11076) Create_Process succeeded, pid=11081 04/09/16 11:59:36 (pid:11076) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 04/09/16 11:59:36 (pid:11076) IO: Failed to read packet header 04/09/16 12:01:35 (pid:11076) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 04/09/16 12:01:35 (pid:11076) IO: Failed to read packet header 04/09/16 12:01:35 (pid:11076) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 04/09/16 12:01:35 (pid:11076) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 04/09/16 12:02:36 (pid:11076) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#161201 04/09/16 12:04:37 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:04:37 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:09:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:09:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:14:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:14:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:19:38 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:19:38 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:24:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:24:39 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:29:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:29:39 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:34:41 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:34:41 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) Got SIGQUIT. Performing fast shutdown. 04/09/16 12:38:51 (pid:11076) ShutdownFast all jobs. 04/09/16 12:38:51 (pid:11076) Process exited, pid=11081, signal=9 04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 666 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 149 bytes to <130.246.180.120:9818>, fd is 11 04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed 04/09/16 12:38:51 (pid:11076) Failed to send job exit status to shadow 04/09/16 12:38:51 (pid:11076) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 04/09/16 12:38:51 (pid:11076) Returning from CStarter::JobReaper() StartdLog: 04/09/16 12:39:21 (pid:8055) starter (pid 11076) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents. 04/09/16 12:39:21 (pid:8055) Starter pid 11076 died on signal 9 (signal 9 (Killed)) 04/09/16 12:39:21 (pid:8055) State change: starter exited 04/09/16 12:39:21 (pid:8055) SECMAN: SetSessionExpiration failed to find session <10.0.2.15:40168>#1460181508#1 04/09/16 12:39:21 (pid:8055) State change: No preempting claim, returning to owner 04/09/16 12:39:21 (pid:8055) Changing state and activity: Preempting/Killing -> Owner/Idle 04/09/16 12:39:21 (pid:8055) State change: IS_OWNER is false 04/09/16 12:39:21 (pid:8055) Changing state: Owner -> Unclaimed 04/09/16 12:39:21 (pid:8055) State change: RunBenchmarks is TRUE 04/09/16 12:39:21 (pid:8055) Changing activity: Idle -> Benchmarking 04/09/16 12:39:21 (pid:8055) BenchMgr:StartBenchmarks() 04/09/16 12:39:25 (pid:8055) State change: RunBenchmarks is TRUE 04/09/16 12:39:25 (pid:8055) Changing activity: Benchmarking -> Idle 04/09/16 12:39:41 (pid:8055) State change: benchmarks completed 04/09/16 12:39:41 (pid:8055) No resources have been claimed for 30 seconds 04/09/16 12:39:41 (pid:8055) Shutting down Condor on this machine. 04/09/16 12:39:41 (pid:8055) Got SIGTERM. Performing graceful shutdown. 04/09/16 12:39:41 (pid:8055) shutdown graceful 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) Deleting cron job manager 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) Deleting benchmark job mgr 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) Killing job mips 04/09/16 12:39:41 (pid:8055) Killing job kflops 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'mips' 04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'mips' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_mips), timer -1 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'kflops' 04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'kflops' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_kflops), timer -1 04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs 04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs 04/09/16 12:39:41 (pid:8055) All resources are free, exiting. 04/09/16 12:39:41 (pid:8055) **** condor_startd (condor_STARTD) pid 8055 EXITING WITH STATUS 0 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
After suspending a tasks for 15min, there are no more entries in stderr.txt in the slot folder. Job started, Job finished etc is not shown after resume. The logs in "show graphivs appear normal. 2016-05-04 09:53:49 (3372): Guest Log: [INFO] CMS application starting. Check log files. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Actually, the task, that was not suspended is doing the same. (suspend) was from 11.14 to 11.30 local time. 2016-05-04 11:26:49 (2104): Guest Log: [INFO] New Job Starting |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
I am currently testing Suspend/Resume. Pausing (LAIM) for 30 minutes went fine and am now trying for 60 minutes. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
You will have to explicitly state on the website, that LAIM has to be turned on. Otherwise, it means nothing. For production purposes, worst case scenario is to be assumed. That means, it also should work, to some degree, with LAIM off. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
The vm (task), i suspended for 15min is now, 7 hours later, still 15min behind the host clock. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Please try this with the Theory app. This is where I am working. But don't try until tomorrow as I still need to do an update. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I will try. A second suspend cause the job to be abandoned and a new one started. Time is now in sync with host. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
The current issue is that if the TCP connection breaks to the STARTD, it fails to reconnect with the same ccbid and fails with the message on the server this ccbid has no reconnect info! The job may continue till the end but in the StarterLog you will see the following. 05/04/16 19:25:51 Failed to send job exit status to shadow EDIT: This is for the Theory app so we should switch this discussion there. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Found this in starter.log: 05/28/16 16:34:32 (pid:6956) Create_Process succeeded, pid=6960 05/28/16 18:34:47 (pid:6956) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 05/28/16 18:34:47 (pid:6956) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 05/28/16 18:35:31 (pid:6956) Lost connection to shadow, waiting 7200 secs for reconnect 05/28/16 18:35:48 (pid:6956) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#92895 05/28/16 18:36:21 (pid:6956) Accepted request to reconnect from <:0> 05/28/16 18:36:21 (pid:6956) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018> 05/28/16 18:36:21 (pid:6956) Communicating with shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018> 05/28/16 18:36:21 (pid:6956) Recovered connection to shadow after 50 seconds 05/28/16 18:59:01 (pid:6956) Process exited, pid=6960, status=0 05/28/16 18:59:03 (pid:6956) Got SIGQUIT. Performing fast shutdown. |
©2025 CERN