Message boards :
News :
Graceful Shutdown Now Implemented
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The graceful shutdown of VMs has now been implemented. When the VM is older than 24 hours, after the current run has finished the VM will shut itself down using the completion_trigger_file method. More precisely, a file is placed in a shared directory between the host and guest that signals to the BOINC client that the task has ended. To verify that the VM was gracefully shutdown, the message VM Completion File Detected should be seen in the stderr_txt of the task. This required new app version to be released (v46.22) that contains the following changes to the job description:
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the info. However, it would have been nice, if you had informed us BEFORE the release. Would you mind checking, why no new vm is booting? I have not been able to get one to run all day. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Download is fine. VM not booting. Please see this thread: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=78&postid=1751#1751 If you need more info, please let me know. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting. The VM is booting normal. Last lines after cms.cern.ch: Activating Fuse module Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName OK Starting vmcontext_epilog ... bootlogd: no process killed I think that's all normal procedure. Then the screen is blanked and stays blank for ever where normally BOINC username, userid and hostid are displayed for authorization. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This is just an issue with that console. The VM should be working. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 594 |
I've downloaded a job on a Windows box, VM boots up but then does nothing. No cpu usage reported by Windows and 'top' is top in the console. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
This is just an issue with that console. The VM should be working. Yeah, the VM is working, but idling. I (we) don't get jobs to process. In the machine logs, there's only a boot.log, nothing else. With 'top' I never see a process created for the user boinc. Maybe an issue with credentials, proxy and/or authorization to brake through CERN's firewall. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"? Glidein is never even started. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"?Not working for me, either (on Windows, CMS not running on Linux here, at the moment) and I've not attempted any vLHC CMS yet. I have the same symptoms as CP. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
I have the same symptoms as CP. Do not know whether Laurence changed something at his end, but finally I have a job running. Retried what I did before without success, but now it worked. I suspended the task in BOINC, discarded the saved state and booted the VM. After I got a job this time, saved the VM and resumed the BOINC-task. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 594 |
I suspended the task in BOINC, discarded the saved state and booted the VM. Thanks, I just did this and it worked... I suspended the task in BOINC, discarded the saved state in VB manager, and resumed the BOINC-task and it fired up and got work. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Initially, it did not work. But when i started the vm manually, the firewall asked for permission to connect to something. I confirmed. The task was still not working, an when i paused it in the vm, the task errored out in boinc. I aborted the cms-task and started a new one---now it works! |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Thanks, CP, but still having trouble, on Linux this from glidein_stderr Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 signature.ebsf2f.sha1: OK signature.f8abcq.sha1: OK signature.ebeeXx.sha1: OK Mon Feb 1 14:24:25 GMT 2016 Failed to load file 'signature.ebeeXx.sha1' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'. Mon Feb 1 14:24:26 GMT 2016 Sleeping 301 Mon Feb 1 14:29:27 GMT 2016 Sleeping 285 Mon Feb 1 14:34:12 GMT 2016 Sleeping 258 Mon Feb 1 14:38:46 GMT 2016 Sleeping 261 Maybe a transient network thing and it might sort itself out.... not hopeful... try again this evening when I've more time. A second try got further but... condor_vars.eb89he.lst: OK Signature OK for client_group:condor_vars.eb89he.lst. untar.eb89he.cfg: OK Signature OK for client_group:untar.eb89he.cfg. Mon Feb 1 14:51:25 GMT 2016 Failed to load file 'nodes.blacklist' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'. Mon Feb 1 14:51:27 GMT 2016 Sleeping 337 Mon Feb 1 14:57:09 GMT 2016 Sleeping 250 I'll have to turn it off and try again later. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Running OK (cmsRun using 80-90%CPU) on Windows except that I can't see any cmsRun output. There are no cmsRun logs on the webserver. The F5 console is blank. After resetting the project and rebooting the VM to get all the files loaded OK... now also running OK on Linux. Both cmsRun logs and F5 console are OK. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 594 |
As 24 hours elapsed ticked by event 235 was being processed. The job continued and didn't abort early. However a new job has been started ! Not sure if this is because there was some downtime whilst the saved state of the VM was deleted or because the VM wasn't doing anything until after the saved state was deleted and a new VM started. No evidence yet of a shutdown file at 40+ minutes past 24 hours elapsed. Will watch this space... |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 594 |
Completed that job as well and have now got a TEST_HELIX job to do ! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
So, it is already running the 2nd job after the 24h deadline? |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 594 |
Elapsed is now at 26 hours and 50 minutes, still running second job after 24 hours. Hopeful that it is working on the start time of the re-started VM from yesterday morning which is fast approaching in less than 10 minutes ! If this job completes and downloads more work (not a new 24 hour WU) then something is definitely wrong. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Within the VM we have runs that run about 3 or so jobs. The VM should terminate after the run has completed so it may take a few hours after the 24 hours deadline has passed. If we get past 36 hours then something has gone wrong. |
©2024 CERN