Message boards :
CMS Application :
Increased failure rate "error 134 - Aborted"
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 242 Credit: 860,312 RAC: 14 ![]() ![]() |
My failure rate "error 134" has increased by a factor of maybe 10 for the task ....0046_Q compared to task .....0046_P. At least, that's what it looks like from my reading of Dashboard and making no allowance for the small numbers... or anything I've done. None of the jobs were sent for a re-try so I don't know if they would have failed on another host, nor if a particlar host is the culprit. Any ideas? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Can you point me to a task log or a job number? You've got a lot of hosts to trawl through... :-) Dashboard failure rate stays steady at just around 2%. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Error 134 seems to be this: == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Your event has been rejected 10000 times! == CMSSW: EvtGen:Will now abort. == CMSSW: Complete == CMSSW: process id is 6551 status is 134 which seems to be one of the problems when using pseudo-random-number generators (or more likely, in my experience, misusing them...). Since the overall error rate has not increased, I'd surmise that you are just unlucky to randomly (sorry!) hit a higher proportion of them than usual. The only thing that's changed between batches is the increase in the sequence letter. ![]() |
Send message Joined: 20 Mar 15 Posts: 242 Credit: 860,312 RAC: 14 ![]() ![]() |
Can you point me to a task log or a job number? 920, 1394, 1361, 1325... Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...) Thanks for taking a look, Ivan. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
Can you point me to a task log or a job number? OK, 920: == CMSSW: %MSG-e FatalSystemSignal: AfterModConstruction 27-Aug-2016 02:26:20 BST pre-events == CMSSW: A fatal system signal has occurred: segmentation violation == CMSSW: %MSG Difficult to see the problem here, there's just a stack traceback. Most probably this: == CMSSW: #10 0x00002b72728df4aa in _dl_open (file=0x2b7275983178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b7272ab2bc3 1394: ==== CMSRunAnalysis.py FINISHING at Sat Aug 27 10:53:32 2016 ==== Local time : Sat Aug 27 12:53:32 2016 + jobrc=0 + set +x == The job had an exit code of 0 So far so good... ======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 10:54:20 GMT 2016 on 17748-81984-30398 with (short) status 0 ======== Local time: Sat Aug 27 12:54:20 CEST 2016 That looks good to me, and there's a result file on the DataBridge. Dashboard glitch? 1361: ======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 07:13:57 GMT 2016 on 48-87675-1065 with (short) status 0 ======== Local time: Sat Aug 27 09:13:57 CEST 2016 No file this time though, but gfal-copy returned exit status 0. 1325 -- Again: == CMSSW: #10 0x00002b39adb5f4aa in _dl_open (file=0x2b39b0c03178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b39add32bc3 I'd suspect you were having transient network problems around the times these happened. ![]() |
Send message Joined: 20 Mar 15 Posts: 242 Credit: 860,312 RAC: 14 ![]() ![]() |
I'd suspect you were having transient network problems around the times these happened. Well, it has been known... I'll keep an eye out. Thanks again. Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1126 Credit: 7,849,048 RAC: 9 ![]() |
I'd suspect you were having transient network problems around the times these happened. Remember the A in ADSL stands for asymmetric. Your upload is probably around 1 Mbps, so probably 300-350 MB/hr. Our current result files average about 16 MB, so you might start having problems around 20 jobs/hr -- less if you're also running other apps as well. ![]() |
©2023 CERN