Message boards : Number crunching : cmsRun Fatal Exception
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 2163 - Posted: 2 Mar 2016, 18:47:45 UTC

First job, Run 1 in fresh VM

== CMSSW: Executing CMSSW
== CMSSW: cmsRun -j FrameworkJobReport.xml PSet.py
== CMSSW: ----- Begin Fatal Exception 02-Mar-2016 19:29:23 CET-----------------------
== CMSSW: An exception of category 'Incomplete configuration' occurred while
== CMSSW: [0] Constructing the EventProcessor
== CMSSW: [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
== CMSSW: Exception Message:
== CMSSW: Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
== CMSSW: ----- End Fatal Exception -------------------------------------------------
== CMSSW: Complete
== CMSSW: process id is 7889 status is 65
======== CMSSW OUTPUT FINSHING ========
ERROR: Caught WMExecutionFailure - code = 65 - name = CmsRunFailure - detail = Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_yaapii/execute/dir_7495/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: 65

NOTE: FJR has exit code 8001 and WMCore reports 65; preferring the FJR one.
ERROR: Exceptional exit at Wed Mar 2 18:29:23 2016 (8001): CmsRunFailure
CMSSW error message follows.
Fatal Exception
An exception of category 'Incomplete configuration' occurred while
[0] Constructing the EventProcessor
[1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml

ERROR: Traceback follows:
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 890, in <module>
cmssw = executeCMSSWStack(opts, scram)
File "CMSRunAnalysis.py", line 688, in executeCMSSWStack
cmssw.execute()
File "/home/boinc/CMSRun/glide_yaapii/execute/dir_7495/WMCore.zip/WMCore/WMSpec/Steps/Executors/CMSSW.py", line 233, in execute
raise WMExecutionFailure(returncode, "CmsRunFailure", msg)
WMExecutionFailure: CmsRunFailure
Message: Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_yaapii/execute/dir_7495/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: 65

ModuleName : WMCore.WMSpec.Steps.WMExecutionFailure
MethodName : __init__
ClassInstance : None
FileName : /home/boinc/CMSRun/glide_yaapii/execute/dir_7495/WMCore.zip/WMCore/WMSpec/Steps/WMExecutionFailure.py
ClassName : None
LineNumber : 18
ErrorNr : 65

Traceback:



ERROR: Failed to record execution site name in the FJR from the site-local-config.xml
Traceback (most recent call last):
File "CMSRunAnalysis.py", line 386, in handleException
slc = SiteLocalConfig.loadSiteLocalConfig()
File "/home/boinc/CMSRun/glide_yaapii/execute/dir_7495/WMCore.zip/WMCore/Storage/SiteLocalConfig.py", line 51, in loadSiteLocalConfig
raise SiteConfigError(msg)
SiteConfigError: Unable to find site local config file:
/cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml

Dashboard end parameters: {'MonitorID': '160226_150549:ireid_crab_CMS_at_Home_MinBias_250evE', 'MonitorJobID': '2936_https://glidein.cern.ch/2936/160226:150549:ireid:crab:CMS:at:Home:MinBias:250evE_0', 'NEventsProcessed': 0, 'JobExitCode': 8001, 'ExeExitCode': 8001}
Not sending data to popularity service because no input sources found.
Dashboard popularity report: {'Basename': '', 'inputFiles': '', 'BasenameParent': '', 'inputBlocks': 'MCFakeBlock', 'parentFiles': ''}
==== Failure sleep STARTING at Wed Mar 2 18:29:24 2016 ====
Sleeping for 1097 seconds due to failure.
ID: 2163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 2164 - Posted: 2 Mar 2016, 19:05:49 UTC - in response to Message 2163.  

Investigating ...
ID: 2164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 2169 - Posted: 2 Mar 2016, 19:53:14 UTC - in response to Message 2164.  

We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours.
ID: 2169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2171 - Posted: 2 Mar 2016, 20:03:48 UTC

I suggest to make a "Change-log" thread.

This should contain entries with changes made and time and date, when they were applied.
This way, we could look out for changes in behavior and not be surprised by them.
ID: 2171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 2173 - Posted: 2 Mar 2016, 20:16:09 UTC - in response to Message 2171.  

Done.
ID: 2173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2174 - Posted: 2 Mar 2016, 20:23:40 UTC - in response to Message 2173.  

Thanks, Laurence.
ID: 2174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 2175 - Posted: 2 Mar 2016, 21:55:10 UTC - in response to Message 2169.  

We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours.

A few hours are over, rebooted the VM and it's working again.

Thanks
ID: 2175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 2177 - Posted: 2 Mar 2016, 22:24:20 UTC - in response to Message 2169.  

We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours.

Sorry, mea culpa, I suggested it. :-{(
ID: 2177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 2178 - Posted: 2 Mar 2016, 22:33:43 UTC - in response to Message 2175.  
Last modified: 2 Mar 2016, 22:35:33 UTC

We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours.

A few hours are over, rebooted the VM and it's working again.

Thanks

I'm rather afraid that this occurs at the job level rather than the task level, so jobs will keep aborting until the VM is rebooted.
Strike (:-) that, I didn't think it through far enough.
ID: 2178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 2179 - Posted: 3 Mar 2016, 0:54:56 UTC - in response to Message 2178.  

It's not getting better. Was I right the first time? If it's not picking siteinfo up properly from cvmfs, then it is a per-task problem, and the VM needs rebooting to pick up what we originally kludged into cvmfs to get around the problem. If it is picking it up from cvmfs, and the change should have percolated through by now, why is the failure rate still so high?
My Brian hurts!
ID: 2179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : cmsRun Fatal Exception


©2024 CERN