Thread 'Increased failure rate "error 134

Author	Message
m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 4099 - Posted: 28 Aug 2016, 12:52:16 UTC My failure rate "error 134" has increased by a factor of maybe 10 for the task ....0046_Q compared to task .....0046_P. At least, that's what it looks like from my reading of Dashboard and making no allowance for the small numbers... or anything I've done. None of the jobs were sent for a re-try so I don't know if they would have failed on another host, nor if a particlar host is the culprit. Any ideas? ID: 4099 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,370,445 RAC: 786	Message 4101 - Posted: 29 Aug 2016, 10:45:53 UTC - in response to Message 4099. Last modified: 29 Aug 2016, 11:02:04 UTC Can you point me to a task log or a job number? You've got a lot of hosts to trawl through... :-) Dashboard failure rate stays steady at just around 2%. ID: 4101 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,370,445 RAC: 786	Message 4102 - Posted: 29 Aug 2016, 11:01:35 UTC - in response to Message 4101. Error 134 seems to be this: == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Your event has been rejected 10000 times! == CMSSW: EvtGen:Will now abort. == CMSSW: Complete == CMSSW: process id is 6551 status is 134 which seems to be one of the problems when using pseudo-random-number generators (or more likely, in my experience, misusing them...). Since the overall error rate has not increased, I'd surmise that you are just unlucky to randomly (sorry!) hit a higher proportion of them than usual. The only thing that's changed between batches is the increase in the sequence letter. ID: 4102 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 4103 - Posted: 29 Aug 2016, 11:34:53 UTC - in response to Message 4101. Last modified: 29 Aug 2016, 11:35:16 UTC Can you point me to a task log or a job number? 920, 1394, 1361, 1325... Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...) Thanks for taking a look, Ivan. ID: 4103 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,370,445 RAC: 786	Message 4104 - Posted: 29 Aug 2016, 17:33:50 UTC - in response to Message 4103. Last modified: 29 Aug 2016, 17:35:09 UTC Can you point me to a task log or a job number? 920, 1394, 1361, 1325... Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...) Thanks for taking a look, Ivan. OK, 920: == CMSSW: %MSG-e FatalSystemSignal: AfterModConstruction 27-Aug-2016 02:26:20 BST pre-events == CMSSW: A fatal system signal has occurred: segmentation violation == CMSSW: %MSG Difficult to see the problem here, there's just a stack traceback. Most probably this: == CMSSW: #10 0x00002b72728df4aa in _dl_open (file=0x2b7275983178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b7272ab2bc3 , nsid=-2, argc=4, argv=, env=0x7ffc533ce220) at dl-open.c:569 1394: ==== CMSRunAnalysis.py FINISHING at Sat Aug 27 10:53:32 2016 ==== Local time : Sat Aug 27 12:53:32 2016 + jobrc=0 + set +x == The job had an exit code of 0 So far so good... ======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 10:54:20 GMT 2016 on 17748-81984-30398 with (short) status 0 ======== Local time: Sat Aug 27 12:54:20 CEST 2016 That looks good to me, and there's a result file on the DataBridge. Dashboard glitch? 1361: ======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Aug 27 07:13:57 GMT 2016 on 48-87675-1065 with (short) status 0 ======== Local time: Sat Aug 27 09:13:57 CEST 2016 No file this time though, but gfal-copy returned exit status 0. 1325 -- Again: == CMSSW: #10 0x00002b39adb5f4aa in _dl_open (file=0x2b39b0c03178 "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/cms/cmssw/CMSSW_7_1_22/biglib/slc6_amd64_gcc481/pluginSimulation.so", mode=-2147483391, caller_dlopen=0x2b39add32bc3 , nsid=-2, argc=4, argv=, env=0x7ffc264284f0) at dl-open.c:569 I'd suspect you were having transient network problems around the times these happened. ID: 4104 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 4105 - Posted: 29 Aug 2016, 20:25:43 UTC - in response to Message 4104. Last modified: 29 Aug 2016, 20:35:23 UTC I'd suspect you were having transient network problems around the times these happened. Well, it has been known... I'll keep an eye out. Thanks again. Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right. ID: 4105 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,370,445 RAC: 786	Message 4107 - Posted: 29 Aug 2016, 21:10:33 UTC - in response to Message 4105. I'd suspect you were having transient network problems around the times these happened. Well, it has been known... I'll keep an eye out. Thanks again. Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right. Remember the A in ADSL stands for asymmetric. Your upload is probably around 1 Mbps, so probably 300-350 MB/hr. Our current result files average about 16 MB, so you might start having problems around 20 jobs/hr -- less if you're also running other apps as well. ID: 4107 · Rating: 0 · rate: / Reply Quote

Development for LHC@home