Thread 'Graceful Shutdown Now Implemented'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1758 - Posted: 31 Jan 2016, 22:20:59 UTC Last modified: 31 Jan 2016, 22:21:35 UTC The graceful shutdown of VMs has now been implemented. When the VM is older than 24 hours, after the current run has finished the VM will shut itself down using the completion_trigger_file method. More precisely, a file is placed in a shared directory between the host and guest that signals to the BOINC client that the task has ended. To verify that the VM was gracefully shutdown, the message VM Completion File Detected should be seen in the stderr_txt of the task. This required new app version to be released (v46.22) that contains the following changes to the job description: Set the completion trigger file to be shutdown Enabled the shared directory Increased the job duration to be 36 hours (to avoid the BOINC client from shutting it down but still there for protection) Copied the init_data.xml to the shared directory (to support later improvements) ID: 1758 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1759 - Posted: 31 Jan 2016, 22:40:28 UTC - in response to Message 1758. Thanks for the info. However, it would have been nice, if you had informed us BEFORE the release. Would you mind checking, why no new vm is booting? I have not been able to get one to run all day. ID: 1759 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1760 - Posted: 31 Jan 2016, 23:43:04 UTC I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting. ID: 1760 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1761 - Posted: 31 Jan 2016, 23:46:13 UTC - in response to Message 1760. Download is fine. VM not booting. Please see this thread: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=78&postid=1751#1751 If you need more info, please let me know. ID: 1761 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 1762 - Posted: 1 Feb 2016, 7:57:13 UTC - in response to Message 1760. I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting. The VM is booting normal. Last lines after cms.cern.ch: Activating Fuse module Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName OK Starting vmcontext_epilog ... bootlogd: no process killed I think that's all normal procedure. Then the screen is blanked and stays blank for ever where normally BOINC username, userid and hostid are displayed for authorization. ID: 1762 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1763 - Posted: 1 Feb 2016, 8:38:03 UTC - in response to Message 1762. This is just an issue with that console. The VM should be working. ID: 1763 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1764 - Posted: 1 Feb 2016, 9:10:19 UTC - in response to Message 1763. I've downloaded a job on a Windows box, VM boots up but then does nothing. No cpu usage reported by Windows and 'top' is top in the console. ID: 1764 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 1765 - Posted: 1 Feb 2016, 9:15:51 UTC - in response to Message 1763. Last modified: 1 Feb 2016, 9:16:48 UTC This is just an issue with that console. The VM should be working. Yeah, the VM is working, but idling. I (we) don't get jobs to process. In the machine logs, there's only a boot.log, nothing else. With 'top' I never see a process created for the user boinc. Maybe an issue with credentials, proxy and/or authorization to brake through CERN's firewall. ID: 1765 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1767 - Posted: 1 Feb 2016, 10:41:30 UTC Last modified: 1 Feb 2016, 10:41:42 UTC Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"? Glidein is never even started. ID: 1767 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1768 - Posted: 1 Feb 2016, 10:52:03 UTC - in response to Message 1767. Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"? Glidein is never even started. Not working for me, either (on Windows, CMS not running on Linux here, at the moment) and I've not attempted any vLHC CMS yet. I have the same symptoms as CP. ID: 1768 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 1769 - Posted: 1 Feb 2016, 11:28:28 UTC - in response to Message 1768. I have the same symptoms as CP. Do not know whether Laurence changed something at his end, but finally I have a job running. Retried what I did before without success, but now it worked. I suspended the task in BOINC, discarded the saved state and booted the VM. After I got a job this time, saved the VM and resumed the BOINC-task. ID: 1769 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1770 - Posted: 1 Feb 2016, 11:52:08 UTC - in response to Message 1769. I suspended the task in BOINC, discarded the saved state and booted the VM. After I got a job this time, saved the VM and resumed the BOINC-task. Thanks, I just did this and it worked... I suspended the task in BOINC, discarded the saved state in VB manager, and resumed the BOINC-task and it fired up and got work. ID: 1770 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1771 - Posted: 1 Feb 2016, 11:55:02 UTC - in response to Message 1770. Last modified: 1 Feb 2016, 11:56:50 UTC Initially, it did not work. But when i started the vm manually, the firewall asked for permission to connect to something. I confirmed. The task was still not working, an when i paused it in the vm, the task errored out in boinc. I aborted the cms-task and started a new one---now it works! ID: 1771 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1779 - Posted: 1 Feb 2016, 14:54:36 UTC Last modified: 1 Feb 2016, 15:03:54 UTC Thanks, CP, but still having trouble, on Linux this from glidein_stderr Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 signature.ebsf2f.sha1: OK signature.f8abcq.sha1: OK signature.ebeeXx.sha1: OK Mon Feb 1 14:24:25 GMT 2016 Failed to load file 'signature.ebeeXx.sha1' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'. Mon Feb 1 14:24:26 GMT 2016 Sleeping 301 Mon Feb 1 14:29:27 GMT 2016 Sleeping 285 Mon Feb 1 14:34:12 GMT 2016 Sleeping 258 Mon Feb 1 14:38:46 GMT 2016 Sleeping 261 Maybe a transient network thing and it might sort itself out.... not hopeful... try again this evening when I've more time. A second try got further but... condor_vars.eb89he.lst: OK Signature OK for client_group:condor_vars.eb89he.lst. untar.eb89he.cfg: OK Signature OK for client_group:untar.eb89he.cfg. Mon Feb 1 14:51:25 GMT 2016 Failed to load file 'nodes.blacklist' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'. Mon Feb 1 14:51:27 GMT 2016 Sleeping 337 Mon Feb 1 14:57:09 GMT 2016 Sleeping 250 I'll have to turn it off and try again later. ID: 1779 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1794 - Posted: 2 Feb 2016, 3:20:28 UTC Running OK (cmsRun using 80-90%CPU) on Windows except that I can't see any cmsRun output. There are no cmsRun logs on the webserver. The F5 console is blank. After resetting the project and rebooting the VM to get all the files loaded OK... now also running OK on Linux. Both cmsRun logs and F5 console are OK. ID: 1794 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1796 - Posted: 2 Feb 2016, 9:33:44 UTC - in response to Message 1794. As 24 hours elapsed ticked by event 235 was being processed. The job continued and didn't abort early. However a new job has been started ! Not sure if this is because there was some downtime whilst the saved state of the VM was deleted or because the VM wasn't doing anything until after the saved state was deleted and a new VM started. No evidence yet of a shutdown file at 40+ minutes past 24 hours elapsed. Will watch this space... ID: 1796 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1797 - Posted: 2 Feb 2016, 11:00:04 UTC - in response to Message 1796. Completed that job as well and have now got a TEST_HELIX job to do ! ID: 1797 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1798 - Posted: 2 Feb 2016, 11:36:29 UTC - in response to Message 1797. Last modified: 2 Feb 2016, 11:37:18 UTC So, it is already running the 2nd job after the 24h deadline? ID: 1798 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1799 - Posted: 2 Feb 2016, 11:43:04 UTC - in response to Message 1798. Elapsed is now at 26 hours and 50 minutes, still running second job after 24 hours. Hopeful that it is working on the start time of the re-started VM from yesterday morning which is fast approaching in less than 10 minutes ! If this job completes and downloads more work (not a new 24 hour WU) then something is definitely wrong. ID: 1799 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1800 - Posted: 2 Feb 2016, 12:59:35 UTC - in response to Message 1799. Within the VM we have runs that run about 3 or so jobs. The VM should terminate after the run has completed so it may take a few hours after the 24 hours deadline has passed. If we get past 36 hours then something has gone wrong. ID: 1800 · Rating: 0 · rate: / Reply Quote

Development for LHC@home