Message boards :
Number crunching :
New(?) failure mode
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
After having seen a job just stop mid-way with no error message yesterday, which I initially put down to the memory-swapping problem I had at the time, I've paid closer attention and found a few more. They occur in both a task started from CMS-dev and a task from vLHC. It might still be a memory issue, since I'm running close to the limit with two VMs at once, but has anyone else noticed this? The cmsRun-stdout.log just stops after an arbitrary number of events. The _condor_stdout file just has a "killed" line as below, always "line 75" as far as I've seen. Its size is around 30 KB, rather than the usual 120 KB or so as it doesn't include the cmsRun log. No log gets returned to the Condor server. 2016-01-28 09:29:59,023:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/TweakPSet.py --location=/home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337 --inputFile='job_input_file_list_2337.txt' --runAndLumis='job_lumis_2337.json' --firstEvent=350401 --lastEvent=350551 --firstLumi=4673 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-28 09:30:01,852:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] /home/boinc/CMSRun/glide_bnkoTJ/execute/dir_17337/condor_exec.exe: line 75: 17393 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode real 25m54.684s user 0m0.031s sys 0m0.282s |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, i have had one like this. Job 6212 in the last batch. It just stopped at event 115. Logs are gone, unfortunately. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
I have found 2 more that just stopped as they were about to start processing rather than half-way through the events... 2016-01-27 20:44:59,659:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/TweakPSet.py --location=/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430 --inputFile='job_input_file_list_1240.txt' --runAndLumis='job_lumis_1240.json' --firstEvent=185851 --lastEvent=186001 --firstLumi=2479 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-27 20:45:01,475:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] Job Running time in seconds: 7 Job runtime is less than 20minutes. Sleeping 1193 /home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/condor_exec.exe: line 12: 17723 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode 2016-01-27 20:47:03,775:CRITICAL:CMSSW:Error running cmsRun {'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_KtqIXk/execute/dir_17430/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']} Return code: -9 2016-01-28 03:20:01,340:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/TweakPSet.py --location=/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358 --inputFile='job_input_file_list_1851.txt' --runAndLumis='job_lumis_1851.json' --firstEvent=277501 --lastEvent=277651 --firstLumi=3701 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-28 03:20:03,133:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] Job Running time in seconds: 6 Job runtime is less than 20minutes. Sleeping 1194 /home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/condor_exec.exe: line 12: 12662 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode 2016-01-28 03:22:03,145:CRITICAL:CMSSW:Error running cmsRun {'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']} Return code: -9 Thought I had found one that had stopped half way through only to realise that was the one it was currently working on Doh ! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Thought I had found one that had stopped half way through only to realise that was the one it was currently working on Doh ! Yes, you have to watch that... :-) |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
This is what the cmsRun-stdout.log has for those... Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_IMWZED/execute/dir_12358/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Set Driver verbosity to -2 New QGSP_FTFP_BERT physics list, replaces LEP with FTF/P for p/n/pi (/K?) Thresholds: 1) between BERT and FTF/P over the interval 6 to 8 GeV. 2) between FTF/P and QGS/P over the interval 12 to 25 GeV. -- quasiElastic was asked to be 1 Changed to 1 for QGS and to 0 (must be false) for FTF |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
This is what the cmsRun-stdout.log has for those... OK, they are dying in the startup before the first event begins. Possibly trying to download conditions database from the Frontier server? I don't remember if that comes before or during the first event -- a lot of the initialisation is "lazy", it's only performed when the class is first called. Edit: I just started up new tasks and all the net downloading occurred before the message came up for the first event, so it might be network glitches. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
One more... 2016-01-28 14:45:01,528:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] Job Running time in seconds: 7 Job runtime is less than 20minutes. Sleeping 1193 /home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/condor_exec.exe: line 12: 32017 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode 2016-01-28 14:47:03,795:CRITICAL:CMSSW:Error running cmsRun {'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_MUfjF9/execute/dir_31716/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']} Return code: -9 Number of CMS jobs currently running has shot up ! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Number of CMS jobs currently running has shot up ! That is, because they linked VLHC@home to it. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Number of CMS jobs currently running has shot up ! Yes, I just realised that about 30 mins ago. I thought initially that it was due to vLHC sending two tasks at once, until we more than doubled the normal usage... Edit: Oh! Dear! 2,500 users instead of about 100 on CMS-dev? I'm supposed to present a major CMS Upgrade review in 6 days and I haven't started yet due to CMS@Home interruptions. Please forgive me if I go quiet for a while. |
©2025 CERN