Message boards : CMS Application : Error rate going up
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 2947 - Posted: 22 Apr 2016, 16:01:19 UTC

We may run short of jobs, either here or on beta, for a while (probably on beta). We've changed some job requirements, unfortunately mid-batch, so I'm taking a crash course on HTCondor commands to try to get jobs matched with available slots.
ID: 2947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3046 - Posted: 27 Apr 2016, 10:50:14 UTC

It knows Ivan is busy...
Not only have the errors increased, but although jobs from hosts here aren't showing up as failures,they are either being terminated very soon after starting...

2016-04-27 05:22:31 (3152): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2016-04-27 05:22:41 (3152): Guest Log: [INFO] CMS application starting. Check log files.
2016-04-27 05:22:51 (3152): Guest Log: [INFO] Condor exited with 0
2016-04-27 05:22:51 (3152): Guest Log: [INFO] Shutting Down.
2016-04-27 05:22:51 (3152): VM Completion File Detected.
2016-04-27 05:22:51 (3152): VM Completion Message: Condor exited with 0


or are shown on dashboard as "cooloff" or "aborted".

Not sure if it's related but I'm no longer seeing any F4 or F5 output (apart fromthe headers)

I've just changed some Linux hosts to BOINC 7.4.25 in order to try the "project_max_concurrent. New in 7.4.9" limit but the problem affects different BOINC versions on both Linux and Windows.
ID: 3046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3048 - Posted: 27 Apr 2016, 12:35:11 UTC

Meant to add this to the original message... sorry 'bout that.
Go here and move your mouse over the chart to see the numbers for the last 24 hrs.
ID: 3048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3050 - Posted: 27 Apr 2016, 14:05:32 UTC - in response to Message 3046.  

(Some of) these seem to be the result of miscommunications. Notice the exit code of most of them is 0. The one I looked at in detail finished last Thursday and wrote its output to the data bridge, but the post-processing gave it as a failure and flagged it for retry. On the retry it hit a machine that's got a screwed up VM and terminated immediately. On the third try it also exited with code 0 (but didn't overwrite the first data file), but post-processing also gave it a fail. I'm tempted to terminate all the 160419_185901:ireid_crab_CMS_at_Home_MinBias_250ev10Ke batch, but I'll let them live for now.
ID: 3050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3053 - Posted: 27 Apr 2016, 15:02:13 UTC

I just noticed, that the results are not uploading. It goes straight to getting a new job.
ID: 3053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3056 - Posted: 27 Apr 2016, 15:39:57 UTC - in response to Message 3053.  

I just noticed, that the results are not uploading. It goes straight to getting a new job.

Ah, that's because of this retry issue:
gfal-copy error: 17 (File exists) - Destination davs://data-bridge-test.cern.ch/myfed/cms-boinc/output/user/ireid/CMS_at_Home/CRAB3_MinBias/160419_185901/0008/step1_8279.root exists and overwrite is not set
gfal-copy exit status: 0

So gfal still exits with 0 and the job gets status 0 too. So then post-processing gets:
Wed, 27 Apr 2016 15:55:40 BST(+0000):INFO:RetryJob Job and stageout wrappers finished successfully (exit code 0).
Wed, 27 Apr 2016 15:55:40 BST(+0000):ERROR:RetryJob Payload job was successful, but wrapper exited with non-zero status 1 (stageout failure)?

I think I will kill this batch, it's going to take too long to drain out otherwise. Sorry for the inconvenience, look for a new small batch later tonight (I guess) to see if I can get -dev on the same footing as -beta.
ID: 3056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3069 - Posted: 28 Apr 2016, 14:42:04 UTC

The spurious errors are continuing, jobs finish and stage-out OK but post-prod gives them an error of 1 so they get requeued. The requeue then runs OK but reports it can't write the file (still finishing with status 0) and post-prod give an error 2.
CERN IT is investigating.
ID: 3069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3092 - Posted: 28 Apr 2016, 23:12:13 UTC - in response to Message 3069.  

CERN IT is investigating.

Hopefully a fix is in; we'll see in a few hours.
ID: 3092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3094 - Posted: 29 Apr 2016, 3:48:07 UTC - in response to Message 3092.  

Yep, looks like jobs are getting the right exit code now.
ID: 3094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3128 - Posted: 30 Apr 2016, 18:02:40 UTC
Last modified: 30 Apr 2016, 18:13:48 UTC

Since about 17.00UTC error rate is going trough the roof.

All "unknown".

EDIT: Looks like only jobs, which were run the 1st time pass, everything else fails.
ID: 3128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3131 - Posted: 30 Apr 2016, 19:53:42 UTC

Even jobs,that ran at the 1st attempt are failing.
ID: 3131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3143 - Posted: 1 May 2016, 10:18:34 UTC - in response to Message 3128.  

Tja, I forgot to renew the proxy yesterday, almost certainly a mea culpa!
ID: 3143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3145 - Posted: 1 May 2016, 10:25:46 UTC
Last modified: 1 May 2016, 10:27:28 UTC

What about the unfinished small batches?
59 and 91 jobs left.(out of 100)
ID: 3145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3146 - Posted: 1 May 2016, 10:29:26 UTC - in response to Message 3145.  

What about the unfinished small batches?
59 and 91 jobs left.(out of 100)

I've killed them, they were just small test cases while I worked out how to cope with our rationalised naming scheme. No data lost, everything we are doing is Monte Carlo generation, one random number is as good as the next.
ID: 3146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3147 - Posted: 1 May 2016, 10:31:52 UTC

ID: 3147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3290 - Posted: 7 May 2016, 12:05:04 UTC - in response to Message 3143.  

Tja, I forgot to renew the proxy yesterday, almost certainly a mea culpa!

To forestall a repeat occurrence, I just gave the current batch a new 8-day proxy cert. :-)
ID: 3290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3320 - Posted: 12 May 2016, 16:17:40 UTC
Last modified: 12 May 2016, 16:19:56 UTC

9 Fails in 16 min---that is a VERY high rate!

EDIT: I also noticed, that jobs uploaded are not noted as such even 1 hour after upload.
ID: 3320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3322 - Posted: 12 May 2016, 19:48:18 UTC - in response to Message 3320.  

9 Fails in 16 min---that is a VERY high rate!

EDIT: I also noticed, that jobs uploaded are not noted as such even 1 hour after upload.

Yeah, job rate's gone way up -- will check.
ID: 3322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3323 - Posted: 12 May 2016, 20:01:23 UTC
Last modified: 12 May 2016, 20:02:19 UTC

Looks like a cvmfs problem:
== CMSSW: ----- Begin Fatal Exception 12-May-2016 21:09:51 CEST-----------------
------
== CMSSW: An exception of category 'PluginLibraryLoadError' occurred while
== CMSSW: [0] Constructing the EventProcessor
== CMSSW: [1] Constructing module: class=Pythia6GeneratorFilter label='generator'
== CMSSW: Exception Message:
== CMSSW: unable to load /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_
2_0_SLHC26/lib/slc6_amd64_gcc472/pluginTauolappInterface.so because /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_2_0_SLHC26/lib/slc6_amd64_gcc472/pluginTauolappInterface.so: cannot open shared object file: Input/output error
== CMSSW: ----- End Fatal Exception --------------------------------------------

I'll inform the usual suspects.
ID: 3323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3324 - Posted: 12 May 2016, 20:07:24 UTC
Last modified: 12 May 2016, 20:09:25 UTC

Thanks, Ivan.
I have 4 jobs, long finished but listed as "running" or "pending" several hours after they have been uploaded.
Jobs processed after these are confirmd "finished".

Got this in running. log:
Untarring /var/lib/condor/execute/dir_8198/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
----- Begin Fatal Exception 12-May-2016 21:54:21 CEST-----------------------
An exception of category 'Incomplete configuration' occurred while
[0] Constructing the EventProcessor
[1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
----- End Fatal Exception -------------------------------------------------
Complete
process id is 8463 status is 65
ID: 3324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : Error rate going up


©2024 CERN