Message boards : CMS Application : Suspend/Resume/Disconnect
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2448 - Posted: 18 Mar 2016, 16:11:32 UTC

I have tested 10min disconnect from internet.

It failed, connection was not reestablished.Job was finished and uploaded, but stuck in WNPostproc(Where it is going to time out and rerun by a different host).

03/18/16 16:35:48 (pid:7659) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 8
03/18/16 16:35:48 (pid:7659) Buf::write(): condor_write() failed
03/18/16 16:35:48 (pid:7659) Failed to send job exit status to shadow
03/18/16 16:35:48 (pid:7659) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
03/18/16 16:35:48 (pid:7659) Returning from CStarter::JobReaper()

Previously, it was possible, now it is not.
Getting a new job is also a problem.Even 2h after the end of the disconnect, no new work has been started.

I suggest to change the Claim_Worklife back to where it was.
ID: 2448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2503 - Posted: 22 Mar 2016, 15:23:56 UTC
Last modified: 22 Mar 2016, 15:44:25 UTC

Strange phenomenon:
I suspended a task for 15 min.(LAIM on)
After resume, i looked at the logs.
The events where processed line after line,BUT there was no gap in time!
The events were produced, as i watched, with a time stamp, that was 15min behind real time.

How does that work? Does the vm no use the host system time?

EDIT: The system time for this vm is now 15 min behind.
Another VM, that was also running has the correct time.


EDIT2: The task was ended prematurely.
ID: 2503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2513 - Posted: 22 Mar 2016, 22:46:39 UTC - in response to Message 2503.  

I believe I've seen that behaviour too, but eventually I wasn't sure. I've no idea how or how often the VM checks its clock (against ntp?).
ID: 2513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2514 - Posted: 22 Mar 2016, 22:50:08 UTC - in response to Message 2513.  

EDIT2: The task was ended prematurely.


I will have to check, if that is related, or coincidence.
ID: 2514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 874,518
RAC: 399
Message 2522 - Posted: 23 Mar 2016, 11:20:31 UTC - in response to Message 2513.  

.... I've no idea how or how often the VM checks its clock (against ntp?).


I don't have any hosts working right now, but from memory if you watch the console messages as the VM boots it tries to contact an NTP server (pool.ntp.org).
Also from memory, the Windows hosts here fail but the Linux hosts succeed so perhaps this interferes with setting up whatever NTP arrangement the VM uses.
ID: 2522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2527 - Posted: 23 Mar 2016, 13:16:55 UTC
Last modified: 23 Mar 2016, 13:29:08 UTC

EDIT2: The task was ended prematurely.



I will have to check, if that is related, or coincidence.


This is officially a problem, now.
I suspended a task, resumed. Clock jumped, so time was correct.

Next job, suspended-resumed, time was delayed by the amount of suspend time.
Job finished, reported success, but then task exited prematurely.

2016-03-23 14:06:21 (2868): Guest Log: [INFO] CMS glidein Run 3 ended
2016-03-23 14:06:21 (2868): Guest Log: Log extracts for Run 3 jobs
2016-03-23 14:06:21 (2868): Guest Log: [INFO] No output. Shutting down!
2016-03-23 14:06:21 (2868): VM Completion File Detected.
2016-03-23 14:06:21 (2868): Powering off VM.
2016-03-23 14:06:22 (2868): Successfully stopped VM.
ID: 2527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2552 - Posted: 24 Mar 2016, 12:57:18 UTC - in response to Message 2527.  

Current setings


CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1)
NOT_RESPONDING_TIMEOUT = 10800
CCB_HEARTBEAT_INTERVAL 0
ALIVE_INTERVAL = 1800
MAX_CLAIM_ALIVES_MISSED = 6
JobLeaseDuration = 7200
ID: 2552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2553 - Posted: 24 Mar 2016, 13:14:51 UTC - in response to Message 2552.  

CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1)

I just want to point out, that the line used to be different.Please recheck.

http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=100&postid=2434#2434
ID: 2553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2584 - Posted: 4 Apr 2016, 21:02:11 UTC - in response to Message 2553.  
Last modified: 4 Apr 2016, 21:02:39 UTC

Today we had the pleasure of a visit from Miron Livny, the leader of the HTCondor project. He gave a very good talk on The Principals and Practice of Distributed High Throughput Computing, which among other things explained some of the ideas behind HTCondor. A video of the talk should be available at the following URL in a day or so.

http://indico.cern.ch/event/512911/

While he was here, we took the opportunity to discuss the suspend/resume issue.
ID: 2584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2600 - Posted: 8 Apr 2016, 7:57:25 UTC

If a task is suspended(and resumed), the run (job details) it is on, is not showing in the final log.
At the same time (because of that?)the run will end and the (boinc)task exits with "No Output".

I have had at least 3 cases, when that happend.
ID: 2600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 2620 - Posted: 9 Apr 2016, 14:10:46 UTC

After a long time testing again to see how the situation is now.

Suspended job 8328 with LAIM off from/to:
vLHCathome-dev 09 Apr 10:13:10 task wu_1459860208_1306_0 suspended by user
vLHCathome-dev 09 Apr 11:58:12 task wu_1459860208_1306_0 resumed by user


The job 8328 runs on and did 75 events in 40 minutes and then it was decided to kill the job.

StarterLog:
04/09/16 10:09:25 (pid:11076) Create_Process succeeded, pid=11081
04/09/16 11:59:36 (pid:11076) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
04/09/16 11:59:36 (pid:11076) IO: Failed to read packet header
04/09/16 12:01:35 (pid:11076) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620.
04/09/16 12:01:35 (pid:11076) IO: Failed to read packet header
04/09/16 12:01:35 (pid:11076) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620
04/09/16 12:01:35 (pid:11076) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
04/09/16 12:02:36 (pid:11076) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#161201
04/09/16 12:04:37 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:04:37 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:09:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:09:38 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:14:38 (pid:11076) condor_write(): Socket closed when trying to write 561 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:14:38 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:19:38 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:19:38 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:24:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:24:39 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:29:39 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:29:39 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:34:41 (pid:11076) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:34:41 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:38:51 (pid:11076) Got SIGQUIT. Performing fast shutdown.
04/09/16 12:38:51 (pid:11076) ShutdownFast all jobs.
04/09/16 12:38:51 (pid:11076) Process exited, pid=11081, signal=9
04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 666 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:38:51 (pid:11076) condor_write(): Socket closed when trying to write 149 bytes to <130.246.180.120:9818>, fd is 11
04/09/16 12:38:51 (pid:11076) Buf::write(): condor_write() failed
04/09/16 12:38:51 (pid:11076) Failed to send job exit status to shadow
04/09/16 12:38:51 (pid:11076) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
04/09/16 12:38:51 (pid:11076) Returning from CStarter::JobReaper()


StartdLog:
04/09/16 12:39:21 (pid:8055) starter (pid 11076) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents.
04/09/16 12:39:21 (pid:8055) Starter pid 11076 died on signal 9 (signal 9 (Killed))
04/09/16 12:39:21 (pid:8055) State change: starter exited
04/09/16 12:39:21 (pid:8055) SECMAN: SetSessionExpiration failed to find session <10.0.2.15:40168>#1460181508#1
04/09/16 12:39:21 (pid:8055) State change: No preempting claim, returning to owner
04/09/16 12:39:21 (pid:8055) Changing state and activity: Preempting/Killing -> Owner/Idle
04/09/16 12:39:21 (pid:8055) State change: IS_OWNER is false
04/09/16 12:39:21 (pid:8055) Changing state: Owner -> Unclaimed
04/09/16 12:39:21 (pid:8055) State change: RunBenchmarks is TRUE
04/09/16 12:39:21 (pid:8055) Changing activity: Idle -> Benchmarking
04/09/16 12:39:21 (pid:8055) BenchMgr:StartBenchmarks()
04/09/16 12:39:25 (pid:8055) State change: RunBenchmarks is TRUE
04/09/16 12:39:25 (pid:8055) Changing activity: Benchmarking -> Idle
04/09/16 12:39:41 (pid:8055) State change: benchmarks completed
04/09/16 12:39:41 (pid:8055) No resources have been claimed for 30 seconds
04/09/16 12:39:41 (pid:8055) Shutting down Condor on this machine.
04/09/16 12:39:41 (pid:8055) Got SIGTERM. Performing graceful shutdown.
04/09/16 12:39:41 (pid:8055) shutdown graceful
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) Killing job mips
04/09/16 12:39:41 (pid:8055) Killing job kflops
04/09/16 12:39:41 (pid:8055) Deleting cron job manager
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs
04/09/16 12:39:41 (pid:8055) Deleting benchmark job mgr
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) Killing job mips
04/09/16 12:39:41 (pid:8055) Killing job kflops
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) Killing job mips
04/09/16 12:39:41 (pid:8055) Killing job kflops
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'mips'
04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'mips' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_mips), timer -1
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting job 'kflops'
04/09/16 12:39:41 (pid:8055) CronJob: Deleting job 'kflops' (/home/boinc/CMSRun/glide_Qi05YT/main/condor/libexec/condor_kflops), timer -1
04/09/16 12:39:41 (pid:8055) Cron: Killing all jobs
04/09/16 12:39:41 (pid:8055) CronJobList: Deleting all jobs
04/09/16 12:39:41 (pid:8055) All resources are free, exiting.
04/09/16 12:39:41 (pid:8055) **** condor_startd (condor_STARTD) pid 8055 EXITING WITH STATUS 0
ID: 2620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3242 - Posted: 4 May 2016, 11:25:58 UTC
Last modified: 4 May 2016, 11:28:39 UTC

After suspending a tasks for 15min, there are no more entries in stderr.txt in the slot folder.
Job started, Job finished etc is not shown after resume.

The logs in "show graphivs appear normal.

2016-05-04 09:53:49 (3372): Guest Log: [INFO] CMS application starting. Check log files.
2016-05-04 09:54:49 (3372): Guest Log: [INFO] New Job Starting
2016-05-04 09:54:49 (3372): Guest Log: [INFO] Condor JobID: 1121023
2016-05-04 11:13:57 (3372): VM state change detected. (old = 'running', new = 'paused')
2016-05-04 11:30:50 (3372): VM state change detected. (old = 'paused', new = 'running')
2016-05-04 11:31:30 (3372): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2016-05-04 11:49:41 (3372): Status Report: Job Duration: '64800.000000'
2016-05-04 11:49:41 (3372): Status Report: Elapsed Time: '6009.148701'
2016-05-04 11:49:41 (3372): Status Report: CPU Time: '5104.890625'
ID: 3242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3244 - Posted: 4 May 2016, 11:38:28 UTC
Last modified: 4 May 2016, 11:40:35 UTC

Actually, the task, that was not suspended is doing the same.

(suspend) was from 11.14 to 11.30 local time.

2016-05-04 11:26:49 (2104): Guest Log: [INFO] New Job Starting
2016-05-04 11:26:49 (2104): Guest Log: [INFO] Condor JobID: 1121116
2016-05-04 11:34:50 (2104): Status Report: Job Duration: '64800.000000'
2016-05-04 11:34:50 (2104): Status Report: Elapsed Time: '6001.548033'
2016-05-04 11:34:50 (2104): Status Report: CPU Time: '5099.093750'
2016-05-04 11:39:00 (2104): Guest Log: BIOS: VirtualBox 5.0.18
2016-05-04 13:14:58 (2104): Status Report: Job Duration: '64800.000000'
2016-05-04 13:14:58 (2104): Status Report: Elapsed Time: '12009.233246'
2016-05-04 13:14:58 (2104): Status Report: CPU Time: '10469.375000'
2016-05-04 13:32:50 (2104): Guest Log: 2.2.0.0 3366 0 21720 2615 14 1 1585739 10240001 2 65024 0 9 77.7778 10741 0 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/grid.cern.ch http://128.142.168.202:3125 1
ID: 3244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3248 - Posted: 4 May 2016, 13:37:22 UTC - in response to Message 3244.  

I am currently testing Suspend/Resume. Pausing (LAIM) for 30 minutes went fine and am now trying for 60 minutes.
ID: 3248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3249 - Posted: 4 May 2016, 15:57:18 UTC - in response to Message 3248.  

You will have to explicitly state on the website, that LAIM has to be turned on.
Otherwise, it means nothing.
For production purposes, worst case scenario is to be assumed.
That means, it also should work, to some degree, with LAIM off.
ID: 3249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3250 - Posted: 4 May 2016, 16:23:29 UTC

The vm (task), i suspended for 15min is now, 7 hours later, still 15min behind the host clock.
ID: 3250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3251 - Posted: 4 May 2016, 17:39:38 UTC - in response to Message 2448.  

Please try this with the Theory app. This is where I am working. But don't try until tomorrow as I still need to do an update.
ID: 3251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3252 - Posted: 4 May 2016, 17:55:58 UTC

I will try.
A second suspend cause the job to be abandoned and a new one started.
Time is now in sync with host.
ID: 3252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3254 - Posted: 4 May 2016, 21:58:56 UTC - in response to Message 3252.  
Last modified: 4 May 2016, 21:59:47 UTC

The current issue is that if the TCP connection breaks to the STARTD, it fails to reconnect with the same ccbid and fails with the message on the server this ccbid has no reconnect info! The job may continue till the end but in the StarterLog you will see the following.

05/04/16 19:25:51 Failed to send job exit status to shadow
05/04/16 19:25:51 JobExit() failed, waiting for job lease to expire or for a reconnect attempt


EDIT: This is for the Theory app so we should switch this discussion there.
ID: 3254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3517 - Posted: 28 May 2016, 19:31:09 UTC

Found this in starter.log:

05/28/16 16:34:32 (pid:6956) Create_Process succeeded, pid=6960
05/28/16 18:34:47 (pid:6956) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
05/28/16 18:34:47 (pid:6956) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
05/28/16 18:35:31 (pid:6956) Lost connection to shadow, waiting 7200 secs for reconnect
05/28/16 18:35:48 (pid:6956) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#92895
05/28/16 18:36:21 (pid:6956) Accepted request to reconnect from <:0>
05/28/16 18:36:21 (pid:6956) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018>
05/28/16 18:36:21 (pid:6956) Communicating with shadow <130.246.180.120:9818?noUDP&sock=50946_f1aa_98018>
05/28/16 18:36:21 (pid:6956) Recovered connection to shadow after 50 seconds
05/28/16 18:59:01 (pid:6956) Process exited, pid=6960, status=0
05/28/16 18:59:03 (pid:6956) Got SIGQUIT. Performing fast shutdown.
ID: 3517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Suspend/Resume/Disconnect


©2024 CERN