Message boards : News : Graceful Shutdown Now Implemented
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 13
Message 1801 - Posted: 2 Feb 2016, 13:17:54 UTC - in response to Message 1800.  

Now at 28:19 elapsed.
Completed TEST_HELIX about 5 minutes ago, they seem to last 2 hours (looked at a previous one).
Have now started another TEST_HELIX job.

Strangely stderr.txt updated about 20 minutes before the last job ended, (I thought they updated when a job completed)...

2016-02-02 12:43:53 (836): Status Report: Job Duration: '129600.000000'
2016-02-02 12:43:53 (836): Status Report: Elapsed Time: '100824.546089'
2016-02-02 12:43:53 (836): Status Report: CPU Time: '82464.983418'
ID: 1801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1802 - Posted: 2 Feb 2016, 13:20:02 UTC - in response to Message 1800.  
Last modified: 2 Feb 2016, 13:26:11 UTC

I have had runs with up to 8 valid jobs.
You are saying, it finishes the last RUN not the last JOB, when going past the 24h mark?
ID: 1802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 1803 - Posted: 2 Feb 2016, 13:31:04 UTC - in response to Message 1802.  

Yes, it finishes the last RUN when going past the 24h mark?
ID: 1803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1804 - Posted: 2 Feb 2016, 13:36:41 UTC - in response to Message 1803.  

Is there a way to track the "TEST_HELIX job" outcome on dashboard?
ID: 1804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 13
Message 1805 - Posted: 2 Feb 2016, 13:48:10 UTC - in response to Message 1804.  

So if the original VM (that started approx 9:00 CET) had not been deleted then it would have stopped the WU after the last job in run-4 below...

[DIR] Parent Directory -
[ ] boot.log 01-Feb-2016 12:44 11K
[ ] cron-stderr 01-Feb-2016 12:44 9.4K
[TXT] cron-stdout 02-Feb-2016 11:53 4.0K
[DIR] run-1/ 01-Feb-2016 12:46 -
[DIR] run-2/ 01-Feb-2016 19:51 -
[DIR] run-3/ 02-Feb-2016 02:38 -
[DIR] run-4/ 02-Feb-2016 04:50 -
[DIR] run-5/ 02-Feb-2016 11:54 -

As it now thinks it started at 12:44 CET yesterday I assume it will stop when run-5 has its last job complete.
ID: 1805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 1806 - Posted: 2 Feb 2016, 13:55:21 UTC - in response to Message 1805.  

Yes. That is correct. (if it works)
ID: 1806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 1807 - Posted: 2 Feb 2016, 14:00:16 UTC - in response to Message 1804.  

I think that you can see the outcome in the CMS dashboard but I am not familiar with it.

[url] http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring#action=taskJobs&usergridname=undefined&taskmonid=wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930&what=all[/url]

These jobs may fail as we are testing the integration with the WMAgent which is the main tool that submits the simulation jobs. Hopefully using this tool will free up Ivan from keeping us supplied with jobs so he can spend more time on the message boards :)
ID: 1807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 1812 - Posted: 2 Feb 2016, 16:21:45 UTC

F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch?
ID: 1812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,066,292
RAC: 2,982
Message 1813 - Posted: 2 Feb 2016, 16:30:32 UTC - in response to Message 1812.  

F5 is not working, bot Windows task manager shows the CPU is working. Is this a glitch?

Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton?
ID: 1813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 13
Message 1814 - Posted: 2 Feb 2016, 16:34:10 UTC - in response to Message 1813.  

F5 probably doesn't work because you will be running a TEST_HELIX job.
Mine still shows the last 'Ivan' job from this morning, nothing but HELIX since then...
ID: 1814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 1815 - Posted: 2 Feb 2016, 16:44:40 UTC - in response to Message 1813.  

Can you look at the logs in boincmgr with the (misleading) "Show graphics" buton?

Here is an excerpt from the boot log:
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring chunk tables... done
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring inode generation... done
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Restoring open files counter... done
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved glue buffer
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing chunk tables
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing saved inode generation info
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Releasing open files counter
    Tue Feb 2 09:22:26 2016: grid.cern.ch: Activating Fuse module


Nothing in the cron-stderr log.

From the cron-stdout log:

    type : RFC 3820 compliant impersonation proxy
    strength : 1024 bits
    path : /tmp/x509up_u500
    timeleft : 129:59:57 (5.4 days)
    09:23:06 -0500 2016-02-02 [INFO] Downloading glidein
    09:23:11 -0500 2016-02-02 [INFO] Running glidein (check logs)

ID: 1815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 1820 - Posted: 2 Feb 2016, 18:35:14 UTC

F5 works again with the second run. Definitely a glitch! 😉
ID: 1820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 13
Message 1826 - Posted: 3 Feb 2016, 10:20:44 UTC - in response to Message 1806.  

Yes. That is correct. (if it works)

Sorry, had to go out last night before it completed.

It eventually completed at 5:30pm (UTC) with an elapsed time of 32:39.
ID: 1826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,066,292
RAC: 2,982
Message 1827 - Posted: 3 Feb 2016, 11:03:49 UTC - in response to Message 1820.  

F5 works again with the second run. Definitely a glitch! 😉

I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash.
ID: 1827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 8
Message 1828 - Posted: 3 Feb 2016, 12:18:08 UTC - in response to Message 1827.  

I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash.

I'm only getting Hassen Riahi's riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill tasks, but they all end successful.
The success is in _condor_stdout. Last part of 1 job:
Copying 96833 bytes file:///home/boinc/CMSRun/glide_lwaQNe/execute/dir_15927/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz
gfal-copy exit status: 0
Command exited with status: 0
===> Stage Out Successful: {'SEName': 'data-bridge-test.cern.ch', 'LFN': '/store/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz', 'GUID': None, 'StageOutCommand': 'gfal2', 'PFN': 'https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz'}
Startup.py : 2016-02-03T11:33:08 : completing task
Startup.py : 2016-02-03T11:33:08 : shutting down monitor
WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent finished the job, is copying the pickled report
total 44
-rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 Report.1.pkl
-rw-r--r-- 1 boinc cms 382 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 618 Feb 3 10:09 __init__.pyc
drwxr-xr-x 3 boinc cms 4096 Feb 3 11:21 cmsRun1
drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 logArch1
drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 stageOut1
-rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 WMTaskSpace/Report.1.pkl
-rw-r--r-- 1 boinc cms 382 Feb 3 10:09 WMTaskSpace/__init__.py
-rw-r--r-- 1 boinc cms 618 Feb 3 10:09 WMTaskSpace/__init__.pyc

WMTaskSpace/cmsRun1:
total 62824
drwxr-xr-x 15 boinc cms 4096 Feb 3 10:09 CMSSW_4_1_8_patch14
-rw-r--r-- 1 boinc cms 22293 Feb 3 11:21 FrameworkJobReport.xml
-rw-r--r-- 1 boinc cms 393174 Feb 3 10:09 PSet.pkl
-rw-r--r-- 1 boinc cms 132 Feb 3 10:09 PSet.py
-rw-r--r-- 1 boinc cms 8431 Feb 3 10:09 PSet.pyc
-rw-r--r-- 1 boinc cms 63818875 Feb 3 11:21 RAWSIMoutput.root
-rw-r--r-- 1 boinc cms 7132 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 341 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 605 Feb 3 10:09 __init__.pyc
-rw-r--r-- 1 boinc cms 1798 Feb 3 10:09 cmsRun1-main.sh
-rw-r--r-- 1 boinc cms 276 Feb 3 10:09 cmsRun1-stderr.log
-rw-r--r-- 1 boinc cms 31429 Feb 3 11:21 cmsRun1-stdout.log
-rw-r--r-- 1 boinc cms 6 Feb 3 10:09 process.id
-rw-r--r-- 1 boinc cms 88 Feb 3 10:09 scramOutput.log

WMTaskSpace/logArch1:
total 112
-rw-r--r-- 1 boinc cms 4523 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 342 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 608 Feb 3 11:33 __init__.pyc
-rw-r--r-- 1 boinc cms 96833 Feb 3 11:33 logArchive.tar.gz

WMTaskSpace/stageOut1:
total 20
-rw-r--r-- 1 boinc cms 8697 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 343 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 611 Feb 3 11:21 __init__.pyc
WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent is finished. The job had an exit code of 0
ID: 1828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1829 - Posted: 3 Feb 2016, 13:32:40 UTC - in response to Message 1827.  

F5 works again with the second run. Definitely a glitch! 😉

I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up,

I've had this on "non-helix" jobs. On both Linux and Windows. BT's network problems can't have helped.

unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash.

Maybe CERN IT are up to something. I was in the habit checking the reports from this to see if there were uplink problems but I've now been shut out...
ID: 1829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,066,292
RAC: 2,982
Message 1830 - Posted: 3 Feb 2016, 13:45:53 UTC - in response to Message 1828.  

I've got one doing the same thing. From the Condor log it's one of Hassen's WMAgent jobs. There are no cmsRun logs. These jobs are having a problem starting up, unable to contact the Frontier server that holds the "conditions database" with the parameters for the detector at the era being simulated, so the simulation crashes. I don't know why there is such a delay in reporting the crash.

I'm only getting Hassen Riahi's riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill tasks, but they all end successful.
The success is in _condor_stdout. Last part of 1 job:
Copying 96833 bytes file:///home/boinc/CMSRun/glide_lwaQNe/execute/dir_15927/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz
gfal-copy exit status: 0
Command exited with status: 0
===> Stage Out Successful: {'SEName': 'data-bridge-test.cern.ch', 'LFN': '/store/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz', 'GUID': None, 'StageOutCommand': 'gfal2', 'PFN': 'https://data-bridge-test.cern.ch/myfed/cms-boinc/output/unmerged/logs/prod/2016/2/3/riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930/Production/0000/1/9e1b9ff6-c5f8-11e5-a1da-001dd8b71c94-4-1-logArchive.tar.gz'}
Startup.py : 2016-02-03T11:33:08 : completing task
Startup.py : 2016-02-03T11:33:08 : shutting down monitor
WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent finished the job, is copying the pickled report
total 44
-rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 Report.1.pkl
-rw-r--r-- 1 boinc cms 382 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 618 Feb 3 10:09 __init__.pyc
drwxr-xr-x 3 boinc cms 4096 Feb 3 11:21 cmsRun1
drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 logArch1
drwxr-xr-x 2 boinc cms 4096 Feb 3 11:33 stageOut1
-rw-r--r-- 1 boinc cms 20690 Feb 3 11:33 WMTaskSpace/Report.1.pkl
-rw-r--r-- 1 boinc cms 382 Feb 3 10:09 WMTaskSpace/__init__.py
-rw-r--r-- 1 boinc cms 618 Feb 3 10:09 WMTaskSpace/__init__.pyc

WMTaskSpace/cmsRun1:
total 62824
drwxr-xr-x 15 boinc cms 4096 Feb 3 10:09 CMSSW_4_1_8_patch14
-rw-r--r-- 1 boinc cms 22293 Feb 3 11:21 FrameworkJobReport.xml
-rw-r--r-- 1 boinc cms 393174 Feb 3 10:09 PSet.pkl
-rw-r--r-- 1 boinc cms 132 Feb 3 10:09 PSet.py
-rw-r--r-- 1 boinc cms 8431 Feb 3 10:09 PSet.pyc
-rw-r--r-- 1 boinc cms 63818875 Feb 3 11:21 RAWSIMoutput.root
-rw-r--r-- 1 boinc cms 7132 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 341 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 605 Feb 3 10:09 __init__.pyc
-rw-r--r-- 1 boinc cms 1798 Feb 3 10:09 cmsRun1-main.sh
-rw-r--r-- 1 boinc cms 276 Feb 3 10:09 cmsRun1-stderr.log
-rw-r--r-- 1 boinc cms 31429 Feb 3 11:21 cmsRun1-stdout.log
-rw-r--r-- 1 boinc cms 6 Feb 3 10:09 process.id
-rw-r--r-- 1 boinc cms 88 Feb 3 10:09 scramOutput.log

WMTaskSpace/logArch1:
total 112
-rw-r--r-- 1 boinc cms 4523 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 342 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 608 Feb 3 11:33 __init__.pyc
-rw-r--r-- 1 boinc cms 96833 Feb 3 11:33 logArchive.tar.gz

WMTaskSpace/stageOut1:
total 20
-rw-r--r-- 1 boinc cms 8697 Feb 3 11:33 Report.pkl
-rw-r--r-- 1 boinc cms 343 Feb 3 10:09 __init__.py
-rw-r--r-- 1 boinc cms 611 Feb 3 11:21 __init__.pyc
WMAgent bootstrap : Wed Feb 3 10:33:08 UTC 2016 : WMAgent is finished. The job had an exit code of 0

That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed. So the problem is (mostly) something in the implementation of the F5 console, not that the job isn't running.
ID: 1830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 1831 - Posted: 3 Feb 2016, 14:01:09 UTC - in response to Message 1830.  

The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again.
ID: 1831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 8
Message 1832 - Posted: 3 Feb 2016, 14:18:21 UTC - in response to Message 1830.  

That looks like copying the tarball of the logs succeeded; a bit further up you'll see attempts to copy a root file which as far as I can see on my machine don't succeed.

You're right!
It seems it failed after 3 attempts to copy the 61MB results root-file with exit: 151 - Error during attempted file stageout.
Can't Hassen remove his batch until this is solved. We tested enough now on his batch and don't want to waste more time since the outcome is obvious failing.

Now a new CMS application started with Run 3, I get your jobs again.
ID: 1832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,066,292
RAC: 2,982
Message 1833 - Posted: 3 Feb 2016, 15:16:24 UTC - in response to Message 1831.  

The F5 console is probably not working with the other jobs as the log file is different. Once I know which one it is I can make it work again.

I found my current one at /home/boinc/CMSRun/glide_hshdGp/execute/dir_23289/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log
This seems to overwrite any previous job's output; this is the fourth job in this run.
ID: 1833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : Graceful Shutdown Now Implemented


©2024 CERN