Message boards : News : Suspend/Resume
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2309 - Posted: 10 Mar 2016, 22:43:21 UTC - in response to Message 2308.  

Thanks Crystal for testing that.
Interesting.
Could you post the first few lines from starter_log(after the task was resumed)?

Is it an OS thing?
Is it the ISP?

I could try 1h30min.Why that should work, when 45min don't,it should not.
ID: 2309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2310 - Posted: 10 Mar 2016, 22:48:33 UTC - in response to Message 2302.  

The effort hasn't been put on hold, just moved from the fast track to the medium speed track. It is turning out to be a little more complicated than just adjusting one parameter and isn't exactly quick to test. I really appreciate all your efforts on this one! As Condor will be central to most of what we are doing, getting this issue nailed has a big impact.
ID: 2310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2311 - Posted: 10 Mar 2016, 22:50:36 UTC - in response to Message 2308.  

Crystal, please could you try just over 2 hours?
ID: 2311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2313 - Posted: 11 Mar 2016, 6:29:19 UTC - in response to Message 2309.  

Could you post the first few lines from starter_log(after the task was resumed)?

Good morning (here). Logs meanwhile gone while task finished.
Result with the jobs done: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117217
ID: 2313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2314 - Posted: 11 Mar 2016, 6:36:11 UTC - in response to Message 2311.  

Crystal, please could you try just over 2 hours?

Current task is over 6 hours and will be killed after current job, so I'll test after a new task and job has started. TTYL.

CP
ID: 2314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2316 - Posted: 11 Mar 2016, 8:55:15 UTC - in response to Message 2313.  

I have put the log for wu_1457619778_27_0 here:

http://lfield.web.cern.ch/lfield/StarterLog

To me it looks like it is not resuming correctly.

03/10/16 17:52:48 (pid:7588) Create_Process succeeded, pid=7592
03/10/16 19:04:42 (pid:7588) Process exited, pid=7592, status=0
03/10/16 19:04:43 (pid:7588) Got SIGQUIT. Performing fast shutdown.
03/10/16 19:04:43 (pid:7588) ShutdownFast all jobs.

03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603
03/10/16 22:34:18 (pid:9599) Process exited, pid=9603, status=0
03/10/16 22:34:18 (pid:9599) Got SIGQUIT. Performing fast shutdown.
03/10/16 22:34:18 (pid:9599) ShutdownFast all jobs.

03/10/16 22:34:21 (pid:12131) Running job as user (null)
03/10/16 22:34:21 (pid:12131) Create_Process succeeded, pid=12135
03/11/16 00:50:47 (pid:12131) Process exited, pid=12135, status=0
03/11/16 00:50:47 (pid:12131) Got SIGQUIT. Performing fast shutdown.
03/11/16 00:50:47 (pid:12131) ShutdownFast all jobs.
ID: 2316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2320 - Posted: 11 Mar 2016, 10:03:13 UTC - in response to Message 2316.  

03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603
03/10/16 22:34:18 (pid:9599) Process exited, pid=9603, status=0
03/10/16 22:34:18 (pid:9599) Got SIGQUIT. Performing fast shutdown.
03/10/16 22:34:18 (pid:9599) ShutdownFast all jobs.

It resumes correctly and returns a valid job. Above is the one with the 90 minutes suspend and resume.

From the result:
2016-03-10 19:13:38 (8660): VM state change detected. (old = 'running', new = 'paused')
2016-03-10 20:43:41 (8660): VM state change detected. (old = 'paused', new = 'running')

(mind that the host and guest clocks are a bit out of sync)

and

2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh STARTING at Thu Mar 10 18:04:46 GMT 2016 on 38-37-6640 ========
2016-03-11 00:55:48 (8660): Guest Log: jobNumber: 9215
2016-03-11 00:55:48 (8660): Guest Log: .....
2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: Finished remote stageout of user output files (status 0).
2016-03-11 00:55:48 (8660): Guest Log: Will not inject transfer requests to ASO for the user output files, because they were staged out directly to the permanent storage.
2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: cmscp.py FINISHING (status 0).
2016-03-11 00:55:48 (8660): Guest Log: ======== Stageout at Thu Mar 10 21:34:18 GMT 2016 FINISHING (short status 0) ========
2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Mar 10 21:34:18 GMT 2016 on 38-37-6640 with (short) status 0 ========
2016-03-11 00:55:48 (8660): Guest Log: Local time: Thu Mar 10 22:34:18 CET 2016
2016-03-11 00:55:48 (8660): Guest Log: Short exit status: 0
2016-03-11 00:55:48 (8660): Guest Log: Job Running time in seconds: 12572


The test with a suspend of 2 hours and 4 minutes kills the job:
03/11/16 10:41:16 (pid:7583) State change: claim lease expired (condor_schedd gone?)
03/11/16 10:41:16 (pid:7583) Changing state and activity: Claimed/Busy -> Preempting/Killing

and a new cmsRun is started.
ID: 2320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2321 - Posted: 11 Mar 2016, 10:18:37 UTC - in response to Message 2320.  
Last modified: 11 Mar 2016, 10:21:55 UTC

When i suspend a boinc tasks, the VM is stopped.

2016-03-10 18:30:23 (2416): Status Report: Job Duration: '129600.000000'
2016-03-10 18:30:23 (2416): Status Report: Elapsed Time: '6008.000799'
2016-03-10 18:30:23 (2416): Status Report: CPU Time: '5444.062500'
2016-03-10 19:40:20 (2416): Stopping VM.
2016-03-10 19:40:35 (2416): Successfully stopped VM.
2016-03-10 20:25:13 (3224): vboxwrapper (7.7.26184): starting
2016-03-10 20:25:13 (3224): Feature: Checkpoint interval offset (185 seconds)
2016-03-10 20:25:13 (3224): Detected: VirtualBox COM Interface (Version: 5.0.16)
2016-03-10 20:25:13 (3224): Detected: Anonymous Platform Enabled
2016-03-10 20:25:13 (3224): Detected: Minimum checkpoint interval (600.000000 seconds)
2016-03-10 20:25:13 (3224): Detected: Heatbeat check (file: '$s' every 0.000000 seconds)
2016-03-10 20:25:13 (3224): Starting VM. (boinc_4917bfd85daa2da7, slot#5)


Why is yours just paused? ( Do you have : Leave app in memory enabled?)

From the result:
2016-03-10 19:13:38 (8660): VM state change detected. (old = 'running', new = 'paused')
2016-03-10 20:43:41 (8660): VM state change detected. (old = 'paused', new = 'running')
(mind that the host and guest clocks are a bit out of sync)
ID: 2321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2322 - Posted: 11 Mar 2016, 10:27:15 UTC - in response to Message 2321.  

Why is yours just paused? ( Do you have : Leave app in memory enabled?)

That's true.
Normally I would test twice (LAIM on/off), but for the moment I tested with LAIM on,
because for the timeout's we're testing here, it should not make a difference, cause guest and server can't reach each other when the VM is 'only' paused.
ID: 2322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2323 - Posted: 11 Mar 2016, 10:31:09 UTC - in response to Message 2322.  

At least, that is different to mine.
Let's see, if that causes the difference in behavior.
ID: 2323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2330 - Posted: 11 Mar 2016, 11:46:15 UTC - in response to Message 2323.  
Last modified: 11 Mar 2016, 11:48:04 UTC

It seems LAIM (Leave application in memory while suspended) Boinc setting has a significant impact on the suspend duration,after which it correctly resumes.
I enabled it, and it looks like i can suspend for longer than before.

I believe, that a paused VM leaves the network ports open, while a saved VM does not.

That is the difference between LAIM on/off.

I still have to wait for the job to be confirmed valid,but i am certain, it will.
ID: 2330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2347 - Posted: 11 Mar 2016, 18:28:28 UTC

Indeed, with LAIM off and a suspend (at 18:37:19), I see much more info after the resume (at 19:18:41) about lost connection.

03/11/16 18:12:17 (pid:10545) Running job as user (null)
03/11/16 18:12:17 (pid:10545) Create_Process succeeded, pid=10550
03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header
03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620.
03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header
03/11/16 19:19:45 (pid:10545) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620
03/11/16 19:19:45 (pid:10545) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
03/11/16 19:20:45 (pid:10545) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#139143


But the job runs on. Could repeat this with a suspend > 60 minutes.
ID: 2347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2348 - Posted: 11 Mar 2016, 18:41:51 UTC - in response to Message 2347.  
Last modified: 11 Mar 2016, 18:47:22 UTC

I have tried (with LAIM on) 45min---perfectly OK.
1h55min--failed- claim expired.

EDIT:I am going to try 1h45min LAIM ON
ID: 2348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 2350 - Posted: 11 Mar 2016, 21:23:07 UTC

Suspending 68min with LAIM off results in premature ending the job and BOINC-task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117566
ID: 2350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2351 - Posted: 11 Mar 2016, 21:32:11 UTC
Last modified: 11 Mar 2016, 21:36:31 UTC

1h45min suspend with LAIM on Failed.

03/11/16 22:15:14 (pid:7588) condor_read() failed: recv(fd=12) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:49685>.
03/11/16 22:15:14 (pid:7588) IO: Failed to read packet header
03/11/16 22:15:14 (pid:7588) AUTHENTICATE: handshake failed!
03/11/16 22:15:14 (pid:7588) DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake
03/11/16 22:15:14 (pid:7588) Starter pid 7629 exited with status 0
03/11/16 22:15:14 (pid:7588) State change: starter exited
03/11/16 22:15:14 (pid:7588) State change: No preempting claim, returning to owner
03/11/16 22:15:14 (pid:7588) Changing state and activity: Preempting/Killing -> Owner/Idle


Boinc-task was ended: 1 job failed out of 1.
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117664

It was the first job in that task.

NOTE: IT did not run a single event after resume.
ID: 2351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2394 - Posted: 15 Mar 2016, 15:54:54 UTC - in response to Message 2351.  
Last modified: 15 Mar 2016, 15:58:13 UTC

The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours with LAIM on. Here are the current settings.

CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1)
NOT_RESPONDING_TIMEOUT = 10800
CCB_HEARTBEAT_INTERVAL 0
ALIVE_INTERVAL = 1800
MAX_CLAIM_ALIVES_MISSED = 6
JobLeaseDuration = 7200

The settings are shown in the glidein-stderr file from the graphics/logs.
ID: 2394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2472 - Posted: 21 Mar 2016, 12:41:01 UTC - in response to Message 2394.  

CLAIM_WORKLIFE set back to 3600
ID: 2472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2476 - Posted: 21 Mar 2016, 13:35:24 UTC - in response to Message 2472.  

Thanks Laurence,
CLAIM_WORKLIFE default is actually 1200sec (20min).
Will see, how it goes.
ID: 2476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 4579 - Posted: 22 Dec 2016, 7:08:43 UTC

The overnight suspending of a CMS-job did not work anymore.
The yesterday evening's job (~40% done) run after the resume 20 minutes and was killed then with error code 2.

StartLog:
12/22/16 07:29:26 condor_write(): Socket closed when trying to write 4006 bytes to collector vccondor01.cern.ch, fd is 9, errno=104 Connection reset by peer
12/22/16 07:29:26 Buf::write(): condor_write() failed
12/22/16 07:29:26 CCBListener: no activity from CCB server in 32440s; assuming connection is dead.
12/22/16 07:29:26 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
12/22/16 07:30:26 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601902
12/22/16 07:50:06 Starter pid 4134 exited with status 2
12/22/16 07:50:06 State change: starter exited


From StarterLog:
12/22/16 07:30:06 (pid:4134) CCBListener: no activity from CCB server in 32480s; assuming connection is dead.
12/22/16 07:30:06 (pid:4134) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
12/22/16 07:30:06 (pid:4134) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>.
12/22/16 07:30:06 (pid:4134) IO: Failed to read packet header
12/22/16 07:30:06 (pid:4134) Lost connection to shadow, waiting 1200 secs for reconnect
12/22/16 07:31:06 (pid:4134) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601927
12/22/16 07:50:06 (pid:4134) No reconnect from shadow for 1200 seconds, aborting job execution!
12/22/16 07:50:06 (pid:4134) **** condor_starter (condor_STARTER) pid 4134 EXITING WITH STATUS 2
ID: 4579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 4592 - Posted: 29 Dec 2016, 23:13:14 UTC - in response to Message 4579.  

The overnight suspending of a CMS-job did not work anymore.


Here is the long answer ...

HTCondor is a distributed system composed of a number of services that communicate using ClassAds (modelled after classified adverts in newspapers). The Collector is the central information repository to which all other services register with a ClassAd. The Sched is essentially the job queue and job ClassAds are sent to the Collector. The Startd runs in the VM and sends a machine ClassAd to the Collector. The Negotiator periodically looks at the information in the Collector and matches jobs with slots on machines. When a match is made, a Shadow process if forked off from the Sched to handle interactions with the job slot.

So for suspend/resume to work, settings need to be adjusted in a few places. For the new WMAgent workloads,the Sched is new and managed by CMS so these settings have not been adjusted. Initially we planned to add this to the dev project in January but as Rasputin seemed keen, we pushed it out before the annual shut down.

The short answer is ...
It should be working for CMS in LHC@home and is not working here as this is experimental.

Thanks for reporting this and we will address it soon.
ID: 4592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5

Message boards : News : Suspend/Resume


©2024 CERN