Message boards : News : Graceful Shutdown Now Implemented
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 1758 - Posted: 31 Jan 2016, 22:20:59 UTC
Last modified: 31 Jan 2016, 22:21:35 UTC

The graceful shutdown of VMs has now been implemented. When the VM is older than 24 hours, after the current run has finished the VM will shut itself down using the completion_trigger_file method. More precisely, a file is placed in a shared directory between the host and guest that signals to the BOINC client that the task has ended. To verify that the VM was gracefully shutdown, the message VM Completion File Detected should be seen in the stderr_txt of the task. This required new app version to be released (v46.22) that contains the following changes to the job description:

  • Set the completion trigger file to be shutdown
  • Enabled the shared directory
  • Increased the job duration to be 36 hours (to avoid the BOINC client from shutting it down but still there for protection)
  • Copied the init_data.xml to the shared directory (to support later improvements)

ID: 1758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1759 - Posted: 31 Jan 2016, 22:40:28 UTC - in response to Message 1758.  

Thanks for the info.
However, it would have been nice, if you had informed us BEFORE the release.

Would you mind checking, why no new vm is booting?
I have not been able to get one to run all day.
ID: 1759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 1760 - Posted: 31 Jan 2016, 23:43:04 UTC

I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting.
ID: 1760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1761 - Posted: 31 Jan 2016, 23:46:13 UTC - in response to Message 1760.  

Download is fine.
VM not booting.
Please see this thread:
http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=78&postid=1751#1751

If you need more info, please let me know.
ID: 1761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 30
Message 1762 - Posted: 1 Feb 2016, 7:57:13 UTC - in response to Message 1760.  

I have just connected to the project with my laptop and it seems fine at least for Linux. Please let me know if it is failing on the download or if there is an error when the VM is booting.

The VM is booting normal.
Last lines after cms.cern.ch: Activating Fuse module
Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName OK
Starting vmcontext_epilog ...
bootlogd: no process killed

I think that's all normal procedure.

Then the screen is blanked and stays blank for ever where normally BOINC username, userid and hostid are displayed for authorization.
ID: 1762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 1763 - Posted: 1 Feb 2016, 8:38:03 UTC - in response to Message 1762.  

This is just an issue with that console. The VM should be working.
ID: 1763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,105,818
RAC: 9,104
Message 1764 - Posted: 1 Feb 2016, 9:10:19 UTC - in response to Message 1763.  

I've downloaded a job on a Windows box, VM boots up but then does nothing.
No cpu usage reported by Windows and 'top' is top in the console.
ID: 1764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 30
Message 1765 - Posted: 1 Feb 2016, 9:15:51 UTC - in response to Message 1763.  
Last modified: 1 Feb 2016, 9:16:48 UTC

This is just an issue with that console. The VM should be working.

Yeah, the VM is working, but idling. I (we) don't get jobs to process.
In the machine logs, there's only a boot.log, nothing else.
With 'top' I never see a process created for the user boinc.

Maybe an issue with credentials, proxy and/or authorization to brake through CERN's firewall.
ID: 1765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1767 - Posted: 1 Feb 2016, 10:41:30 UTC
Last modified: 1 Feb 2016, 10:41:42 UTC

Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"?
Glidein is never even started.
ID: 1767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1768 - Posted: 1 Feb 2016, 10:52:03 UTC - in response to Message 1767.  

Maybe only cm-volunteers that also did a vlhc cms-task are "blacklisted"?
Glidein is never even started.
Not working for me, either (on Windows, CMS not running on Linux here, at the moment) and I've not attempted any vLHC CMS yet.
I have the same symptoms as CP.
ID: 1768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 30
Message 1769 - Posted: 1 Feb 2016, 11:28:28 UTC - in response to Message 1768.  

I have the same symptoms as CP.

Do not know whether Laurence changed something at his end, but finally I have a job running.
Retried what I did before without success, but now it worked.

I suspended the task in BOINC, discarded the saved state and booted the VM.
After I got a job this time, saved the VM and resumed the BOINC-task.
ID: 1769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,105,818
RAC: 9,104
Message 1770 - Posted: 1 Feb 2016, 11:52:08 UTC - in response to Message 1769.  

I suspended the task in BOINC, discarded the saved state and booted the VM.
After I got a job this time, saved the VM and resumed the BOINC-task.

Thanks, I just did this and it worked...

I suspended the task in BOINC, discarded the saved state in VB manager, and resumed the BOINC-task and it fired up and got work.
ID: 1770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1771 - Posted: 1 Feb 2016, 11:55:02 UTC - in response to Message 1770.  
Last modified: 1 Feb 2016, 11:56:50 UTC

Initially, it did not work.
But when i started the vm manually, the firewall asked for permission to connect to something. I confirmed.
The task was still not working, an when i paused it in the vm, the task errored out in boinc.
I aborted the cms-task and started a new one---now it works!
ID: 1771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1779 - Posted: 1 Feb 2016, 14:54:36 UTC
Last modified: 1 Feb 2016, 15:03:54 UTC

Thanks, CP, but still having trouble, on Linux this from glidein_stderr

Setting X509_USER_PROXY to canonical path /tmp/x509up_u500
signature.ebsf2f.sha1: OK
signature.f8abcq.sha1: OK
signature.ebeeXx.sha1: OK
Mon Feb 1 14:24:25 GMT 2016 Failed to load file 'signature.ebeeXx.sha1' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'.
Mon Feb 1 14:24:26 GMT 2016 Sleeping 301
Mon Feb 1 14:29:27 GMT 2016 Sleeping 285
Mon Feb 1 14:34:12 GMT 2016 Sleeping 258
Mon Feb 1 14:38:46 GMT 2016 Sleeping 261


Maybe a transient network thing and it might sort itself out.... not hopeful... try again this evening when I've more time.

A second try got further but...

condor_vars.eb89he.lst: OK
Signature OK for client_group:condor_vars.eb89he.lst.
untar.eb89he.cfg: OK
Signature OK for client_group:untar.eb89he.cfg.
Mon Feb 1 14:51:25 GMT 2016 Failed to load file 'nodes.blacklist' from 'http://lcggwms02.gridpp.rl.ac.uk:8319/vofrontend/stage/frontend_frontend_service-v3_2_7/group_main'.
Mon Feb 1 14:51:27 GMT 2016 Sleeping 337
Mon Feb 1 14:57:09 GMT 2016 Sleeping 250

I'll have to turn it off and try again later.
ID: 1779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1794 - Posted: 2 Feb 2016, 3:20:28 UTC

Running OK (cmsRun using 80-90%CPU) on Windows except that I can't see any cmsRun output.
There are no cmsRun logs on the webserver. The F5 console is blank.

After resetting the project and rebooting the VM to get all the files loaded OK... now also running OK on Linux. Both cmsRun logs and F5 console are OK.
ID: 1794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,105,818
RAC: 9,104
Message 1796 - Posted: 2 Feb 2016, 9:33:44 UTC - in response to Message 1794.  

As 24 hours elapsed ticked by event 235 was being processed.
The job continued and didn't abort early.
However a new job has been started !

Not sure if this is because there was some downtime whilst the saved state of the VM was deleted or because the VM wasn't doing anything until after the saved state was deleted and a new VM started.

No evidence yet of a shutdown file at 40+ minutes past 24 hours elapsed.
Will watch this space...
ID: 1796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,105,818
RAC: 9,104
Message 1797 - Posted: 2 Feb 2016, 11:00:04 UTC - in response to Message 1796.  

Completed that job as well and have now got a TEST_HELIX job to do !
ID: 1797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1798 - Posted: 2 Feb 2016, 11:36:29 UTC - in response to Message 1797.  
Last modified: 2 Feb 2016, 11:37:18 UTC

So, it is already running the 2nd job after the 24h deadline?
ID: 1798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,105,818
RAC: 9,104
Message 1799 - Posted: 2 Feb 2016, 11:43:04 UTC - in response to Message 1798.  

Elapsed is now at 26 hours and 50 minutes, still running second job after 24 hours.

Hopeful that it is working on the start time of the re-started VM from yesterday morning which is fast approaching in less than 10 minutes !

If this job completes and downloads more work (not a new 24 hour WU) then something is definitely wrong.
ID: 1799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 1800 - Posted: 2 Feb 2016, 12:59:35 UTC - in response to Message 1799.  

Within the VM we have runs that run about 3 or so jobs. The VM should terminate after the run has completed so it may take a few hours after the 24 hours deadline has passed. If we get past 36 hours then something has gone wrong.
ID: 1800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : News : Graceful Shutdown Now Implemented


©2024 CERN