Message boards :
News :
Graceful Shutdown Now Implemented
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 82 |
Now at 28:19 elapsed. Completed TEST_HELIX about 5 minutes ago, they seem to last 2 hours (looked at a previous one). Have now started another TEST_HELIX job. Strangely stderr.txt updated about 20 minutes before the last job ended, (I thought they updated when a job completed)... 2016-02-02 12:43:53 (836): Status Report: Job Duration: '129600.000000' 2016-02-02 12:43:53 (836): Status Report: Elapsed Time: '100824.546089' 2016-02-02 12:43:53 (836): Status Report: CPU Time: '82464.983418' |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have had runs with up to 8 valid jobs. You are saying, it finishes the last RUN not the last JOB, when going past the 24h mark? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yes, it finishes the last RUN when going past the 24h mark? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is there a way to track the "TEST_HELIX job" outcome on dashboard? |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 82 |
So if the original VM (that started approx 9:00 CET) had not been deleted then it would have stopped the WU after the last job in run-4 below... [DIR] Parent Directory - [ ] boot.log 01-Feb-2016 12:44 11K [ ] cron-stderr 01-Feb-2016 12:44 9.4K [TXT] cron-stdout 02-Feb-2016 11:53 4.0K [DIR] run-1/ 01-Feb-2016 12:46 - [DIR] run-2/ 01-Feb-2016 19:51 - [DIR] run-3/ 02-Feb-2016 02:38 - [DIR] run-4/ 02-Feb-2016 04:50 - [DIR] run-5/ 02-Feb-2016 11:54 - As it now thinks it started at 12:44 CET yesterday I assume it will stop when run-5 has its last job complete. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yes. That is correct. (if it works) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I think that you can see the outcome in the CMS dashboard but I am not familiar with it. [url] http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring#action=taskJobs&usergridname=undefined&taskmonid=wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930&what=all[/url] These jobs may fail as we are testing the integration with the WMAgent which is the main tool that submits the simulation jobs. Hopefully using this tool will free up Ivan from keeping us supplied with jobs so he can spend more time on the message boards :) |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 6 |
F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch? Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton? |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 82 |
F5 probably doesn't work because you will be running a TEST_HELIX job. Mine still shows the last 'Ivan' job from this morning, nothing but HELIX since then... |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton? Here is an excerpt from the boot log:
Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring inode generation... done Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring open files counter... done Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved glue buffer Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing chunk tables Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved inode generation info Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing open files counter Tue Feb 2 09:22:26 2016: grid.cern.ch: Activating Fuse module
strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:59:57 (5.4 days) 09:23:06 -0500 2016-02-02 [INFO] Downloading glidein 09:23:11 -0500 2016-02-02 [INFO] Running glidein (check logs) |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
F5 works again with the second run. Definitely a glitch! 😉 |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 82 |
Yes. That is correct. (if it works) Sorry, had to go out last night before it completed. It eventually completed at 5:30pm (UTC) with an elapsed time of 32:39. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 6 |
F5 works again with the second run. Definitely a glitch! 😉 I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 738 |
I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. I'm only getting Hassen Riahi's riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill tasks, but they all end successful. The success is in _condor_stdout. Last part of 1 job: Copying 96833 bytes file:///home/boinc/CMSRun/glide_lwaQNe/execute/dir_15927/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz gfal-copy exit status: 0 Command exited with status: 0 ===> Stage Out Successful: {'SEName': 'data-bridge-test.cern.ch', 'LFN': '/store/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz', 'GUID': None, 'StageOutCommand': 'gfal2', 'PFN': 'https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz'} Startup.py : 2016-02-03T11:33:08 : completing task Startup.py : 2016-02-03T11:33:08 : shutting down monitor WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent finished the job, is copying the pickled report total 44 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 __init__.pyc drwxr-xr-x 3 boinc cms 4096 Feb 3 11:21 cmsRun1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 logArch1 drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 stageOut1 -rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 WMTaskSpace/Report.1.pkl -rw-r--r-- 1 boinc cms 382 Feb 3 10:09 WMTaskSpace/__init__.py -rw-r--r-- 1 boinc cms 618 Feb 3 10:09 WMTaskSpace/__init__.pyc WMTaskSpace/cmsRun1: total 62824 drwxr-xr-x 15 boinc cms 4096 Feb 3 10:09 CMSSW_4_1_8_patch14 -rw-r--r-- 1 boinc cms 22293 Feb 3 11:21 FrameworkJobReport.xml -rw-r--r-- 1 boinc cms 393174 Feb 3 10:09 PSet.pkl -rw-r--r-- 1 boinc cms 132 Feb 3 10:09 PSet.py -rw-r--r-- 1 boinc cms 8431 Feb 3 10:09 PSet.pyc -rw-r--r-- 1 boinc cms 63818875 Feb 3 11:21 RAWSIMoutput.root -rw-r--r-- 1 boinc cms 7132 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 341 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 605 Feb 3 10:09 __init__.pyc -rw-r--r-- 1 boinc cms 1798 Feb 3 10:09 cmsRun1-main.sh -rw-r--r-- 1 boinc cms 276 Feb 3 10:09 cmsRun1-stderr.log -rw-r--r-- 1 boinc cms 31429 Feb 3 11:21 cmsRun1-stdout.log -rw-r--r-- 1 boinc cms 6 Feb 3 10:09 process.id -rw-r--r-- 1 boinc cms 88 Feb 3 10:09 scramOutput.log WMTaskSpace/logArch1: total 112 -rw-r--r-- 1 boinc cms 4523 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 342 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 608 Feb 3 11:33 __init__.pyc -rw-r--r-- 1 boinc cms 96833 Feb 3 11:33 logArchive.tar.gz WMTaskSpace/stageOut1: total 20 -rw-r--r-- 1 boinc cms 8697 Feb 3 11:33 Report.pkl -rw-r--r-- 1 boinc cms 343 Feb 3 10:09 __init__.py -rw-r--r-- 1 boinc cms 611 Feb 3 11:21 __init__.pyc WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent is finished. The job had an exit code of 0 |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
F5 works again with the second run. Definitely a glitch! 😉 I've had this on "non-helix" jobs. On both Linux and Windows. BT's network problems can't have helped. unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. Maybe CERN IT are up to something. I was in the habit checking the reports from this to see if there were uplink problems but I've now been shut out... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 6 |
I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash. That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed. So the problem is (mostly) something in the implementation of the F5 console, not that the job isn't running. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 738 |
That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed. You're right! It seems it failed after 3 attempts to copy the 61MB results root-file with exit: 151 - Error during attempted file stageout. Can't Hassen remove his batch until this is solved. We tested enough now on his batch and don't want to waste more time since the outcome is obvious failing. Now a new CMS application started with Run 3, I get your jobs again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 6 |
The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again. I found my current one at /home/boinc/CMSRun/glide_hshdGp/execute/dir_23289/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log This seems to overwrite any previous job's output; this is the fourth job in this run. |
©2025 CERN