Message boards :
News :
Suspend/Resume
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Crystal for testing that. Interesting. Could you post the first few lines from starter_log(after the task was resumed)? Is it an OS thing? Is it the ISP? I could try 1h30min.Why that should work, when 45min don't,it should not. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The effort hasn't been put on hold, just moved from the fast track to the medium speed track. It is turning out to be a little more complicated than just adjusting one parameter and isn't exactly quick to test. I really appreciate all your efforts on this one! As Condor will be central to most of what we are doing, getting this issue nailed has a big impact. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Crystal, please could you try just over 2 hours? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Could you post the first few lines from starter_log(after the task was resumed)? Good morning (here). Logs meanwhile gone while task finished. Result with the jobs done: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117217 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Crystal, please could you try just over 2 hours? Current task is over 6 hours and will be killed after current job, so I'll test after a new task and job has started. TTYL. CP |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I have put the log for wu_1457619778_27_0 here: http://lfield.web.cern.ch/lfield/StarterLog To me it looks like it is not resuming correctly. 03/10/16 17:52:48 (pid:7588) Create_Process succeeded, pid=7592 03/10/16 19:04:42 (pid:7588) Process exited, pid=7592, status=0 03/10/16 19:04:43 (pid:7588) Got SIGQUIT. Performing fast shutdown. 03/10/16 19:04:43 (pid:7588) ShutdownFast all jobs. 03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603 03/10/16 22:34:18 (pid:9599) Process exited, pid=9603, status=0 03/10/16 22:34:18 (pid:9599) Got SIGQUIT. Performing fast shutdown. 03/10/16 22:34:18 (pid:9599) ShutdownFast all jobs. 03/10/16 22:34:21 (pid:12131) Running job as user (null) 03/10/16 22:34:21 (pid:12131) Create_Process succeeded, pid=12135 03/11/16 00:50:47 (pid:12131) Process exited, pid=12135, status=0 03/11/16 00:50:47 (pid:12131) Got SIGQUIT. Performing fast shutdown. 03/11/16 00:50:47 (pid:12131) ShutdownFast all jobs. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
03/10/16 19:04:46 (pid:9599) Create_Process succeeded, pid=9603 It resumes correctly and returns a valid job. Above is the one with the 90 minutes suspend and resume. From the result: 2016-03-10 19:13:38 (8660): VM state change detected. (old = 'running', new = 'paused') 2016-03-10 20:43:41 (8660): VM state change detected. (old = 'paused', new = 'running') (mind that the host and guest clocks are a bit out of sync) and 2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh STARTING at Thu Mar 10 18:04:46 GMT 2016 on 38-37-6640 ======== 2016-03-11 00:55:48 (8660): Guest Log: jobNumber: 9215 2016-03-11 00:55:48 (8660): Guest Log: ..... 2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: Finished remote stageout of user output files (status 0). 2016-03-11 00:55:48 (8660): Guest Log: Will not inject transfer requests to ASO for the user output files, because they were staged out directly to the permanent storage. 2016-03-11 00:55:48 (8660): Guest Log: ====== Thu Mar 10 21:34:18 2016: cmscp.py FINISHING (status 0). 2016-03-11 00:55:48 (8660): Guest Log: ======== Stageout at Thu Mar 10 21:34:18 GMT 2016 FINISHING (short status 0) ======== 2016-03-11 00:55:48 (8660): Guest Log: ======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Mar 10 21:34:18 GMT 2016 on 38-37-6640 with (short) status 0 ======== 2016-03-11 00:55:48 (8660): Guest Log: Local time: Thu Mar 10 22:34:18 CET 2016 2016-03-11 00:55:48 (8660): Guest Log: Short exit status: 0 2016-03-11 00:55:48 (8660): Guest Log: Job Running time in seconds: 12572 The test with a suspend of 2 hours and 4 minutes kills the job: 03/11/16 10:41:16 (pid:7583) State change: claim lease expired (condor_schedd gone?) 03/11/16 10:41:16 (pid:7583) Changing state and activity: Claimed/Busy -> Preempting/Killing and a new cmsRun is started. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
When i suspend a boinc tasks, the VM is stopped. 2016-03-10 18:30:23 (2416): Status Report: Job Duration: '129600.000000' 2016-03-10 18:30:23 (2416): Status Report: Elapsed Time: '6008.000799' 2016-03-10 18:30:23 (2416): Status Report: CPU Time: '5444.062500' 2016-03-10 19:40:20 (2416): Stopping VM. 2016-03-10 19:40:35 (2416): Successfully stopped VM. 2016-03-10 20:25:13 (3224): vboxwrapper (7.7.26184): starting 2016-03-10 20:25:13 (3224): Feature: Checkpoint interval offset (185 seconds) 2016-03-10 20:25:13 (3224): Detected: VirtualBox COM Interface (Version: 5.0.16) 2016-03-10 20:25:13 (3224): Detected: Anonymous Platform Enabled 2016-03-10 20:25:13 (3224): Detected: Minimum checkpoint interval (600.000000 seconds) 2016-03-10 20:25:13 (3224): Detected: Heatbeat check (file: '$s' every 0.000000 seconds) 2016-03-10 20:25:13 (3224): Starting VM. (boinc_4917bfd85daa2da7, slot#5) Why is yours just paused? ( Do you have : Leave app in memory enabled?) From the result: |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Why is yours just paused? ( Do you have : Leave app in memory enabled?) That's true. Normally I would test twice (LAIM on/off), but for the moment I tested with LAIM on, because for the timeout's we're testing here, it should not make a difference, cause guest and server can't reach each other when the VM is 'only' paused. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
At least, that is different to mine. Let's see, if that causes the difference in behavior. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It seems LAIM (Leave application in memory while suspended) Boinc setting has a significant impact on the suspend duration,after which it correctly resumes. I enabled it, and it looks like i can suspend for longer than before. I believe, that a paused VM leaves the network ports open, while a saved VM does not. That is the difference between LAIM on/off. I still have to wait for the job to be confirmed valid,but i am certain, it will. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Indeed, with LAIM off and a suspend (at 18:37:19), I see much more info after the resume (at 19:18:41) about lost connection. 03/11/16 18:12:17 (pid:10545) Running job as user (null) 03/11/16 18:12:17 (pid:10545) Create_Process succeeded, pid=10550 03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=11) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>. 03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header 03/11/16 19:19:45 (pid:10545) condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector lcggwms02.gridpp.rl.ac.uk:9620. 03/11/16 19:19:45 (pid:10545) IO: Failed to read packet header 03/11/16 19:19:45 (pid:10545) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9620 03/11/16 19:19:45 (pid:10545) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 03/11/16 19:20:45 (pid:10545) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#139143 But the job runs on. Could repeat this with a suspend > 60 minutes. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have tried (with LAIM on) 45min---perfectly OK. 1h55min--failed- claim expired. EDIT:I am going to try 1h45min LAIM ON |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Suspending 68min with LAIM off results in premature ending the job and BOINC-task: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117566 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
1h45min suspend with LAIM on Failed. 03/11/16 22:15:14 (pid:7588) condor_read() failed: recv(fd=12) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.0.2.15:49685>. 03/11/16 22:15:14 (pid:7588) IO: Failed to read packet header 03/11/16 22:15:14 (pid:7588) AUTHENTICATE: handshake failed! 03/11/16 22:15:14 (pid:7588) DC_AUTHENTICATE: required authentication of 10.0.2.15 failed: AUTHENTICATE:1002:Failure performing handshake 03/11/16 22:15:14 (pid:7588) Starter pid 7629 exited with status 0 03/11/16 22:15:14 (pid:7588) State change: starter exited 03/11/16 22:15:14 (pid:7588) State change: No preempting claim, returning to owner 03/11/16 22:15:14 (pid:7588) Changing state and activity: Preempting/Killing -> Owner/Idle Boinc-task was ended: 1 job failed out of 1. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=117664 It was the first job in that task. NOTE: IT did not run a single event after resume. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The condor configuration has been updated and has been pushed to CVMFS. Should be available with new glideins (runs) this evening. We are targeting suspending up to 2 hours with LAIM on. Here are the current settings. CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1) NOT_RESPONDING_TIMEOUT = 10800 CCB_HEARTBEAT_INTERVAL 0 ALIVE_INTERVAL = 1800 MAX_CLAIM_ALIVES_MISSED = 6 JobLeaseDuration = 7200 The settings are shown in the glidein-stderr file from the graphics/logs. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
CLAIM_WORKLIFE set back to 3600 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Laurence, CLAIM_WORKLIFE default is actually 1200sec (20min). Will see, how it goes. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
The overnight suspending of a CMS-job did not work anymore. The yesterday evening's job (~40% done) run after the resume 20 minutes and was killed then with error code 2. StartLog: 12/22/16 07:29:26 condor_write(): Socket closed when trying to write 4006 bytes to collector vccondor01.cern.ch, fd is 9, errno=104 Connection reset by peer 12/22/16 07:29:26 Buf::write(): condor_write() failed 12/22/16 07:29:26 CCBListener: no activity from CCB server in 32440s; assuming connection is dead. 12/22/16 07:29:26 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 12/22/16 07:30:26 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601902 12/22/16 07:50:06 Starter pid 4134 exited with status 2 12/22/16 07:50:06 State change: starter exited From StarterLog: 12/22/16 07:30:06 (pid:4134) CCBListener: no activity from CCB server in 32480s; assuming connection is dead. 12/22/16 07:30:06 (pid:4134) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 12/22/16 07:30:06 (pid:4134) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>. 12/22/16 07:30:06 (pid:4134) IO: Failed to read packet header 12/22/16 07:30:06 (pid:4134) Lost connection to shadow, waiting 1200 secs for reconnect 12/22/16 07:31:06 (pid:4134) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#1601927 12/22/16 07:50:06 (pid:4134) No reconnect from shadow for 1200 seconds, aborting job execution! 12/22/16 07:50:06 (pid:4134) **** condor_starter (condor_STARTER) pid 4134 EXITING WITH STATUS 2 |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The overnight suspending of a CMS-job did not work anymore. Here is the long answer ... HTCondor is a distributed system composed of a number of services that communicate using ClassAds (modelled after classified adverts in newspapers). The Collector is the central information repository to which all other services register with a ClassAd. The Sched is essentially the job queue and job ClassAds are sent to the Collector. The Startd runs in the VM and sends a machine ClassAd to the Collector. The Negotiator periodically looks at the information in the Collector and matches jobs with slots on machines. When a match is made, a Shadow process if forked off from the Sched to handle interactions with the job slot. So for suspend/resume to work, settings need to be adjusted in a few places. For the new WMAgent workloads,the Sched is new and managed by CMS so these settings have not been adjusted. Initially we planned to add this to the dev project in January but as Rasputin seemed keen, we pushed it out before the annual shut down. The short answer is ... It should be working for CMS in LHC@home and is not working here as this is experimental. Thanks for reporting this and we will address it soon. |
©2024 CERN