Message boards : CMS Application : Increased failure rate "error 134 - Aborted"
Message board moderation

To post messages, you must log in.

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 242
Credit: 856,216
RAC: 0
Message 4099 - Posted: 28 Aug 2016, 12:52:16 UTC

My failure rate "error 134" has increased by a factor of maybe 10 for the
task ....0046_Q compared to task .....0046_P. At least, that's what it looks
like from my reading of Dashboard and making no allowance for the small
numbers... or anything I've done.

None of the jobs were sent for a re-try so I don't know if they would have
failed on another host, nor if a particlar host is the culprit.

Any ideas?
ID: 4099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 4101 - Posted: 29 Aug 2016, 10:45:53 UTC - in response to Message 4099.  
Last modified: 29 Aug 2016, 11:02:04 UTC

Can you point me to a task log or a job number? You've got a lot of hosts to trawl through... :-)

Dashboard failure rate stays steady at just around 2%.
ID: 4101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 4102 - Posted: 29 Aug 2016, 11:01:35 UTC - in response to Message 4101.  

Error 134 seems to be this:
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
== CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Your event has been rejected 10000 times!
== CMSSW: EvtGen:Will now abort.
== CMSSW: Complete
== CMSSW: process id is 6551 status is 134

which seems to be one of the problems when using pseudo-random-number generators (or more likely, in my experience, misusing them...).
Since the overall error rate has not increased, I'd surmise that you are just unlucky to randomly (sorry!) hit a higher proportion of them than usual. The only thing that's changed between batches is the increase in the sequence letter.
ID: 4102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 242
Credit: 856,216
RAC: 0
Message 4103 - Posted: 29 Aug 2016, 11:34:53 UTC - in response to Message 4101.  
Last modified: 29 Aug 2016, 11:35:16 UTC

Can you point me to a task log or a job number?


920, 1394, 1361, 1325...

Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...)

Thanks for taking a look, Ivan.
ID: 4103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 4104 - Posted: 29 Aug 2016, 17:33:50 UTC - in response to Message 4103.  
Last modified: 29 Aug 2016, 17:35:09 UTC

Can you point me to a task log or a job number?


920, 1394, 1361, 1325...

Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...)

Thanks for taking a look, Ivan.


OK, 920:
== CMSSW: %MSG-e FatalSystemSignal: AfterModConstruction 27-Aug-2016 02:26:20 BST pre-events
== CMSSW: A fatal system signal has occurred: segmentation violation
== CMSSW: %MSG

Difficult to see the problem here, there's just a stack traceback. Most probably this:
== CMSSW: #10 0x00002b72728df4aa in _dl_open (file=0x2b7275983178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b7272ab2bc3 , nsid=-2, argc=4, argv=, env=0x7ffc533ce220) at dl-open.c:569

1394:
==== CMSRunAnalysis.py FINISHING at Sat Aug 27 10:53:32 2016 ====
Local time : Sat Aug 27 12:53:32 2016
+ jobrc=0
+ set +x
== The job had an exit code of 0

So far so good...
======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 10:54:20 GMT 2016 on 17748-81984-30398 with (short) status 0 ========
Local time: Sat Aug 27 12:54:20 CEST 2016

That looks good to me, and there's a result file on the DataBridge. Dashboard glitch?

1361:
======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 07:13:57 GMT 2016 on 48-87675-1065 with (short) status 0 ========
Local time: Sat Aug 27 09:13:57 CEST 2016

No file this time though, but gfal-copy returned exit status 0.

1325 -- Again:
== CMSSW: #10 0x00002b39adb5f4aa in _dl_open (file=0x2b39b0c03178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b39add32bc3 , nsid=-2, argc=4, argv=, env=0x7ffc264284f0) at dl-open.c:569

I'd suspect you were having transient network problems around the times these happened.
ID: 4104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 242
Credit: 856,216
RAC: 0
Message 4105 - Posted: 29 Aug 2016, 20:25:43 UTC - in response to Message 4104.  
Last modified: 29 Aug 2016, 20:35:23 UTC

I'd suspect you were having transient network problems around the times these happened.

Well, it has been known... I'll keep an eye out. Thanks again.

Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right.
ID: 4105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 4107 - Posted: 29 Aug 2016, 21:10:33 UTC - in response to Message 4105.  

I'd suspect you were having transient network problems around the times these happened.

Well, it has been known... I'll keep an eye out. Thanks again.

Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right.

Remember the A in ADSL stands for asymmetric. Your upload is probably around 1 Mbps, so probably 300-350 MB/hr. Our current result files average about 16 MB, so you might start having problems around 20 jobs/hr -- less if you're also running other apps as well.
ID: 4107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : Increased failure rate "error 134 - Aborted"


©2020 CERN