Message boards : Number crunching : New(?) failure mode
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1676 - Posted: 28 Jan 2016, 11:04:21 UTC

After having seen a job just stop mid-way with no error message yesterday, which I initially put down to the memory-swapping problem I had at the time, I've paid closer attention and found a few more. They occur in both a task started from CMS-dev and a task from vLHC. It might still be a memory issue, since I'm running close to the limit with two VMs at once, but has anyone else noticed this?
The cmsRun-stdout.log just stops after an arbitrary number of events. The _condor_stdout file just has a "killed" line as below, always "line 75" as far as I've seen. Its size is around 30 KB, rather than the usual 120 KB or so as it doesn't include the cmsRun log. No log gets returned to the Condor server.

2016-01-28 09:29:59,023:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/TweakPSet.py --location=/home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337 --inputFile='job_input_file_list_2337.txt' --runAndLumis='job_lumis_2337.json' --firstEvent=350401 --lastEvent=350551 --firstLumi=4673 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2016-01-28 09:30:01,852:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
/home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/condor_exec.exe: line 75: 17393 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode

real 25m54.684s
user 0m0.031s
sys 0m0.282s

ID: 1676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1677 - Posted: 28 Jan 2016, 11:08:31 UTC

Yes, i have had one like this.
Job 6212 in the last batch.
It just stopped at event 115.
Logs are gone, unfortunately.
ID: 1677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,875,016
RAC: 16,090
Message 1678 - Posted: 28 Jan 2016, 11:35:05 UTC - in response to Message 1677.  

I have found 2 more that just stopped as they were about to start processing rather than half-way through the events...

2016-01-27 20:44:59,659:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/TweakPSet.py --location=/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430 --inputFile='job_input_file_list_1240.txt' --runAndLumis='job_lumis_1240.json' --firstEvent=185851 --lastEvent=186001 --firstLumi=2479 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2016-01-27 20:45:01,475:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
Job Running time in seconds: 7
Job runtime is less than 20minutes. Sleeping 1193
/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/condor_exec.exe: line 12: 17723 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode
2016-01-27 20:47:03,775:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: -9



2016-01-28 03:20:01,340:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/TweakPSet.py --location=/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358 --inputFile='job_input_file_list_1851.txt' --runAndLumis='job_lumis_1851.json' --firstEvent=277501 --lastEvent=277651 --firstLumi=3701 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2016-01-28 03:20:03,133:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
Job Running time in seconds: 6
Job runtime is less than 20minutes. Sleeping 1194
/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/condor_exec.exe: line 12: 12662 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode
2016-01-28 03:22:03,145:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: -9


Thought I had found one that had stopped half way through only to realise that was the one it was currently working on Doh !
ID: 1678 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1679 - Posted: 28 Jan 2016, 11:41:50 UTC - in response to Message 1678.  

Thought I had found one that had stopped half way through only to realise that was the one it was currently working on Doh !

Yes, you have to watch that... :-)
ID: 1679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,875,016
RAC: 16,090
Message 1680 - Posted: 28 Jan 2016, 11:48:55 UTC - in response to Message 1679.  

This is what the cmsRun-stdout.log has for those...

Beginning CMSSW wrapper script
slc6_amd64_gcc472 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
Set Driver verbosity to -2
New QGSP_FTFP_BERT physics list, replaces LEP with FTF/P for p/n/pi (/K?) Thresholds:
1) between BERT and FTF/P over the interval 6 to 8 GeV.
2) between FTF/P and QGS/P over the interval 12 to 25 GeV.
-- quasiElastic was asked to be 1
Changed to 1 for QGS and to 0 (must be false) for FTF
ID: 1680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1684 - Posted: 28 Jan 2016, 13:42:59 UTC - in response to Message 1680.  
Last modified: 28 Jan 2016, 14:34:49 UTC

This is what the cmsRun-stdout.log has for those...

...
Changed to 1 for QGS and to 0 (must be false) for FTF

OK, they are dying in the startup before the first event begins. Possibly trying to download conditions database from the Frontier server? I don't remember if that comes before or during the first event -- a lot of the initialisation is "lazy", it's only performed when the class is first called.
Edit: I just started up new tasks and all the net downloading occurred before the message came up for the first event, so it might be network glitches.
ID: 1684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,875,016
RAC: 16,090
Message 1685 - Posted: 28 Jan 2016, 16:40:50 UTC - in response to Message 1684.  

One more...

2016-01-28 14:45:01,528:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
Job Running time in seconds: 7
Job runtime is less than 20minutes. Sleeping 1193
/home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/condor_exec.exe: line 12: 32017 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode
2016-01-28 14:47:03,795:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: -9


Number of CMS jobs currently running has shot up !
ID: 1685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1686 - Posted: 28 Jan 2016, 16:52:16 UTC - in response to Message 1685.  
Last modified: 28 Jan 2016, 16:52:28 UTC

Number of CMS jobs currently running has shot up !


That is, because they linked VLHC@home to it.
ID: 1686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1697 - Posted: 29 Jan 2016, 0:38:16 UTC - in response to Message 1686.  
Last modified: 29 Jan 2016, 0:42:48 UTC

Number of CMS jobs currently running has shot up !


That is, because they linked VLHC@home to it.

Yes, I just realised that about 30 mins ago. I thought initially that it was due to vLHC sending two tasks at once, until we more than doubled the normal usage...

Edit: Oh! Dear! 2,500 users instead of about 100 on CMS-dev? I'm supposed to present a major CMS Upgrade review in 6 days and I haven't started yet due to CMS@Home interruptions. Please forgive me if I go quiet for a while.
ID: 1697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : New(?) failure mode


©2024 CERN