Thread 'Graceful Shutdown Now Implemented'

Author	Message
PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1801 - Posted: 2 Feb 2016, 13:17:54 UTC - in response to Message 1800. Now at 28:19 elapsed. Completed TEST_HELIX about 5 minutes ago, they seem to last 2 hours (looked at a previous one). Have now started another TEST_HELIX job. Strangely stderr.txt updated about 20 minutes before the last job ended, (I thought they updated when a job completed)... 2016-02-02 12:43:53 (836): Status Report: Job Duration: '129600.000000' 2016-02-02 12:43:53 (836): Status Report: Elapsed Time: '100824.546089' 2016-02-02 12:43:53 (836): Status Report: CPU Time: '82464.983418' ID: 1801 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1802 - Posted: 2 Feb 2016, 13:20:02 UTC - in response to Message 1800. Last modified: 2 Feb 2016, 13:26:11 UTC I have had runs with up to 8 valid jobs. You are saying, it finishes the last RUN not the last JOB, when going past the 24h mark? ID: 1802 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1803 - Posted: 2 Feb 2016, 13:31:04 UTC - in response to Message 1802. Yes, it finishes the last RUN when going past the 24h mark? ID: 1803 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1804 - Posted: 2 Feb 2016, 13:36:41 UTC - in response to Message 1803. Is there a way to track the "TEST_HELIX job" outcome on dashboard? ID: 1804 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1805 - Posted: 2 Feb 2016, 13:48:10 UTC - in response to Message 1804. So if the original VM (that started approx 9:00 CET) had not been deleted then it would have stopped the WU after the last job in run-4 below... [DIR] Parent Directory - [ ] boot.log 01-Feb-2016 12:44 11K [ ] cron-stderr 01-Feb-2016 12:44 9.4K [TXT] cron-stdout 02-Feb-2016 11:53 4.0K [DIR] run-1/ 01-Feb-2016 12:46 - [DIR] run-2/ 01-Feb-2016 19:51 - [DIR] run-3/ 02-Feb-2016 02:38 - [DIR] run-4/ 02-Feb-2016 04:50 - [DIR] run-5/ 02-Feb-2016 11:54 - As it now thinks it started at 12:44 CET yesterday I assume it will stop when run-5 has its last job complete. ID: 1805 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1806 - Posted: 2 Feb 2016, 13:55:21 UTC - in response to Message 1805. Yes. That is correct. (if it works) ID: 1806 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1807 - Posted: 2 Feb 2016, 14:00:16 UTC - in response to Message 1804. I think that you can see the outcome in the CMS dashboard but I am not familiar with it. [url] http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring#action=taskJobs&usergridname=undefined&taskmonid=wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930&what=all[/url] These jobs may fail as we are testing the integration with the WMAgent which is the main tool that submits the simulation jobs. Hopefully using this tool will free up Ivan from keeping us supplied with jobs so he can spend more time on the message boards :) ID: 1807 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 48 Credit: 1,246,583 RAC: 5,455	Message 1812 - Posted: 2 Feb 2016, 16:21:45 UTC F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch? ID: 1812 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 1813 - Posted: 2 Feb 2016, 16:30:32 UTC - in response to Message 1812. F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch? Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton? ID: 1813 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1814 - Posted: 2 Feb 2016, 16:34:10 UTC - in response to Message 1813. F5 probably doesn't work because you will be running a TEST_HELIX job. Mine still shows the last 'Ivan' job from this morning, nothing but HELIX since then... ID: 1814 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 48 Credit: 1,246,583 RAC: 5,455	Message 1815 - Posted: 2 Feb 2016, 16:44:40 UTC - in response to Message 1813. Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton? Here is an excerpt from the boot log: Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring chunk tables... done Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring inode generation... done Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring open files counter... done Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved glue buffer Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing chunk tables Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved inode generation info Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing open files counter Tue Feb 2 09:22:26 2016: grid.cern.ch: Activating Fuse module Nothing in the cron-stderr log. From the cron-stdout log: type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:59:57 (5.4 days) 09:23:06 -0500 2016-02-02 [INFO] Downloading glidein 09:23:11 -0500 2016-02-02 [INFO] Running glidein (check logs) ID: 1815 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 48 Credit: 1,246,583 RAC: 5,455	Message 1820 - Posted: 2 Feb 2016, 18:35:14 UTC F5 works again with the second run. Definitely a glitch! 😉 ID: 1820 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1826 - Posted: 3 Feb 2016, 10:20:44 UTC - in response to Message 1806. Yes. That is correct. (if it works) Sorry, had to go out last night before it completed. It eventually completed at 5:30pm (UTC) with an elapsed time of 32:39. ID: 1826 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 1827 - Posted: 3 Feb 2016, 11:03:49 UTC - in response to Message 1820. F5 works again with the second run. Definitely a glitch! 😉 I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. ID: 1827 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 1828 - Posted: 3 Feb 2016, 12:18:08 UTC - in response to Message 1827. I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. I'm only getting Hassen Riahi's riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill tasks, but they all end successful. The success is in _condor_stdout. Last part of 1 job: Copying 96833 bytes file:///home/boinc/CMSRun/glide_lwaQNe/execute/dir_15927/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz gfal-copy exit status: 0 Command exited with status: 0 ===> Stage Out Successful: {'SEName': 'data-bridge-test.cern.ch', 'LFN': '/store/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz', 'GUID': None, 'StageOutCommand': 'gfal2', 'PFN': 'https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz'} Startup.py : 2016-02-03T11:33:08 : completing task Startup.py : 2016-02-03T11:33:08 : shutting down monitor WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent finished the job, is copying the pickled report total 44 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 __init__.pyc drwxr-xr-x 3 boinc cms 4096 Feb 3 11:21 cmsRun1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 logArch1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 stageOut1 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 WMTaskSpace/Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 WMTaskSpace/__init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 WMTaskSpace/__init__.pyc WMTaskSpace/cmsRun1: total 62824 drwxr-xr-x 15 boinc cms 4096 Feb 3 10:09 CMSSW_4_1_8_patch14 -rw-r--r-- 1 boinc cms 22293 Feb 3 11:21 FrameworkJobReport.xml -rw-r--r-- 1 boinc cms 393174 Feb 3 10:09 PSet.pkl -rw-r--r-- 1 boinc cms 132 Feb 3 10:09 PSet.py -rw-r--r-- 1 boinc cms 8431 Feb 3 10:09 PSet.pyc -rw-r--r-- 1 boinc cms 63818875 Feb 3 11:21 RAWSIMoutput.root -rw-r--r-- 1 boinc cms 7132 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 341 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 605 Feb 3 10:09 __init__.pyc -rw-r--r-- 1 boinc cms 1798 Feb 3 10:09 cmsRun1-main.sh -rw-r--r-- 1 boinc cms 276 Feb 3 10:09 cmsRun1-stderr.log -rw-r--r-- 1 boinc cms 31429 Feb 3 11:21 cmsRun1-stdout.log -rw-r--r-- 1 boinc cms 6 Feb 3 10:09 process.id -rw-r--r-- 1 boinc cms 88 Feb 3 10:09 scramOutput.log WMTaskSpace/logArch1: total 112 -rw-r--r-- 1 boinc cms 4523 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 342 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 608 Feb 3 11:33 __init__.pyc -rw-r--r-- 1 boinc cms 96833 Feb 3 11:33 logArchive.tar.gz WMTaskSpace/stageOut1: total 20 -rw-r--r-- 1 boinc cms 8697 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 343 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 611 Feb 3 11:21 __init__.pyc WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent is finished. The job had an exit code of 0 ID: 1828 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1829 - Posted: 3 Feb 2016, 13:32:40 UTC - in response to Message 1827. F5 works again with the second run. Definitely a glitch! 😉 I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, I've had this on "non-helix" jobs. On both Linux and Windows. BT's network problems can't have helped. unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. Maybe CERN IT are up to something. I was in the habit checking the reports from this to see if there were uplink problems but I've now been shut out... ID: 1829 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 1830 - Posted: 3 Feb 2016, 13:45:53 UTC - in response to Message 1828. I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. I'm only getting Hassen Riahi's riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill tasks, but they all end successful. The success is in _condor_stdout. Last part of 1 job: Copying 96833 bytes file:///home/boinc/CMSRun/glide_lwaQNe/execute/dir_15927/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz gfal-copy exit status: 0 Command exited with status: 0 ===> Stage Out Successful: {'SEName': 'data-bridge-test.cern.ch', 'LFN': '/store/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz', 'GUID': None, 'StageOutCommand': 'gfal2', 'PFN': 'https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz'} Startup.py : 2016-02-03T11:33:08 : completing task Startup.py : 2016-02-03T11:33:08 : shutting down monitor WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent finished the job, is copying the pickled report total 44 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 __init__.pyc drwxr-xr-x 3 boinc cms 4096 Feb 3 11:21 cmsRun1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 logArch1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 stageOut1 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 WMTaskSpace/Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 WMTaskSpace/__init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 WMTaskSpace/__init__.pyc WMTaskSpace/cmsRun1: total 62824 drwxr-xr-x 15 boinc cms 4096 Feb 3 10:09 CMSSW_4_1_8_patch14 -rw-r--r-- 1 boinc cms 22293 Feb 3 11:21 FrameworkJobReport.xml -rw-r--r-- 1 boinc cms 393174 Feb 3 10:09 PSet.pkl -rw-r--r-- 1 boinc cms 132 Feb 3 10:09 PSet.py -rw-r--r-- 1 boinc cms 8431 Feb 3 10:09 PSet.pyc -rw-r--r-- 1 boinc cms 63818875 Feb 3 11:21 RAWSIMoutput.root -rw-r--r-- 1 boinc cms 7132 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 341 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 605 Feb 3 10:09 __init__.pyc -rw-r--r-- 1 boinc cms 1798 Feb 3 10:09 cmsRun1-main.sh -rw-r--r-- 1 boinc cms 276 Feb 3 10:09 cmsRun1-stderr.log -rw-r--r-- 1 boinc cms 31429 Feb 3 11:21 cmsRun1-stdout.log -rw-r--r-- 1 boinc cms 6 Feb 3 10:09 process.id -rw-r--r-- 1 boinc cms 88 Feb 3 10:09 scramOutput.log WMTaskSpace/logArch1: total 112 -rw-r--r-- 1 boinc cms 4523 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 342 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 608 Feb 3 11:33 __init__.pyc -rw-r--r-- 1 boinc cms 96833 Feb 3 11:33 logArchive.tar.gz WMTaskSpace/stageOut1: total 20 -rw-r--r-- 1 boinc cms 8697 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 343 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 611 Feb 3 11:21 __init__.pyc WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent is finished. The job had an exit code of 0 That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed. So the problem is (mostly) something in the implementation of the F5 console, not that the job isn't running. ID: 1830 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1831 - Posted: 3 Feb 2016, 14:01:09 UTC - in response to Message 1830. The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again. ID: 1831 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 1832 - Posted: 3 Feb 2016, 14:18:21 UTC - in response to Message 1830. That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed. You're right! It seems it failed after 3 attempts to copy the 61MB results root-file with exit: 151 - Error during attempted file stageout. Can't Hassen remove his batch until this is solved. We tested enough now on his batch and don't want to waste more time since the outcome is obvious failing. Now a new CMS application started with Run 3, I get your jobs again. ID: 1832 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 1833 - Posted: 3 Feb 2016, 15:16:24 UTC - in response to Message 1831. The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again. I found my current one at /home/boinc/CMSRun/glide_hshdGp/execute/dir_23289/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log This seems to overwrite any previous job's output; this is the fourth job in this run. ID: 1833 · Rating: 0 · rate: / Reply Quote

Development for LHC@home