Message boards :
CMS Application :
Error rate going up
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
We may run short of jobs, either here or on beta, for a while (probably on beta). We've changed some job requirements, unfortunately mid-batch, so I'm taking a crash course on HTCondor commands to try to get jobs matched with available slots. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
It knows Ivan is busy... Not only have the errors increased, but although jobs from hosts here aren't showing up as failures,they are either being terminated very soon after starting... 2016-04-27 05:22:31 (3152): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2016-04-27 05:22:41 (3152): Guest Log: [INFO] CMS application starting. Check log files. 2016-04-27 05:22:51 (3152): Guest Log: [INFO] Condor exited with 0 2016-04-27 05:22:51 (3152): Guest Log: [INFO] Shutting Down. 2016-04-27 05:22:51 (3152): VM Completion File Detected. 2016-04-27 05:22:51 (3152): VM Completion Message: Condor exited with 0 or are shown on dashboard as "cooloff" or "aborted". Not sure if it's related but I'm no longer seeing any F4 or F5 output (apart fromthe headers) I've just changed some Linux hosts to BOINC 7.4.25 in order to try the "project_max_concurrent. New in 7.4.9" limit but the problem affects different BOINC versions on both Linux and Windows. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Meant to add this to the original message... sorry 'bout that. Go here and move your mouse over the chart to see the numbers for the last 24 hrs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
(Some of) these seem to be the result of miscommunications. Notice the exit code of most of them is 0. The one I looked at in detail finished last Thursday and wrote its output to the data bridge, but the post-processing gave it as a failure and flagged it for retry. On the retry it hit a machine that's got a screwed up VM and terminated immediately. On the third try it also exited with code 0 (but didn't overwrite the first data file), but post-processing also gave it a fail. I'm tempted to terminate all the 160419_185901:ireid_crab_CMS_at_Home_MinBias_250ev10Ke batch, but I'll let them live for now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I just noticed, that the results are not uploading. It goes straight to getting a new job. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
I just noticed, that the results are not uploading. It goes straight to getting a new job. Ah, that's because of this retry issue: gfal-copy error: 17 (File exists) - Destination davs://data-bridge-test.cern.ch/myfed/cms-boinc/output/user/ireid/CMS_at_Home/CRAB3_MinBias/160419_185901/0008/step1_8279.root exists and overwrite is not set gfal-copy exit status: 0 So gfal still exits with 0 and the job gets status 0 too. So then post-processing gets: Wed, 27 Apr 2016 15:55:40 BST(+0000):INFO:RetryJob Job and stageout wrappers finished successfully (exit code 0). Wed, 27 Apr 2016 15:55:40 BST(+0000):ERROR:RetryJob Payload job was successful, but wrapper exited with non-zero status 1 (stageout failure)? I think I will kill this batch, it's going to take too long to drain out otherwise. Sorry for the inconvenience, look for a new small batch later tonight (I guess) to see if I can get -dev on the same footing as -beta. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
The spurious errors are continuing, jobs finish and stage-out OK but post-prod gives them an error of 1 so they get requeued. The requeue then runs OK but reports it can't write the file (still finishing with status 0) and post-prod give an error 2. CERN IT is investigating. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
CERN IT is investigating. Hopefully a fix is in; we'll see in a few hours. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Yep, looks like jobs are getting the right exit code now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Since about 17.00UTC error rate is going trough the roof. All "unknown". EDIT: Looks like only jobs, which were run the 1st time pass, everything else fails. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Even jobs,that ran at the 1st attempt are failing. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Tja, I forgot to renew the proxy yesterday, almost certainly a mea culpa! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What about the unfinished small batches? 59 and 91 jobs left.(out of 100) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
What about the unfinished small batches? I've killed them, they were just small test cases while I worked out how to cope with our rationalised naming scheme. No data lost, everything we are doing is Monte Carlo generation, one random number is as good as the next. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Can you please check the results? http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=206&postid=3122#3122 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Tja, I forgot to renew the proxy yesterday, almost certainly a mea culpa! To forestall a repeat occurrence, I just gave the current batch a new 8-day proxy cert. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
9 Fails in 16 min---that is a VERY high rate! EDIT: I also noticed, that jobs uploaded are not noted as such even 1 hour after upload. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
9 Fails in 16 min---that is a VERY high rate! Yeah, job rate's gone way up -- will check. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Looks like a cvmfs problem: == CMSSW: ----- Begin Fatal Exception 12-May-2016 21:09:51 CEST----------------- ------ == CMSSW: An exception of category 'PluginLibraryLoadError' occurred while == CMSSW: [0] Constructing the EventProcessor == CMSSW: [1] Constructing module: class=Pythia6GeneratorFilter label='generator' == CMSSW: Exception Message: == CMSSW: unable to load /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_ 2_0_SLHC26/lib/slc6_amd64_gcc472/pluginTauolappInterface.so because /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_2_0_SLHC26/lib/slc6_amd64_gcc472/pluginTauolappInterface.so: cannot open shared object file: Input/output error == CMSSW: ----- End Fatal Exception -------------------------------------------- I'll inform the usual suspects. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. I have 4 jobs, long finished but listed as "running" or "pending" several hours after they have been uploaded. Jobs processed after these are confirmd "finished". Got this in running. log: Untarring /var/lib/condor/execute/dir_8198/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py ----- Begin Fatal Exception 12-May-2016 21:54:21 CEST----------------------- An exception of category 'Incomplete configuration' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml ----- End Fatal Exception ------------------------------------------------- Complete process id is 8463 status is 65 |
©2025 CERN