Message boards :
News :
No new jobs
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
What is 151? An exit code? Yes, that's a stage-out failure. Often, but not necessarily, a time-out in returning the result file to the data-bridge (one hour limit). Recently often caused by volunteers running too many hosts and overloading their upload link. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
We seem to have had a systemic increase in failures from mid-Friday until Sunday, There is that effect, to be sure. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
New batch seems to be up and running. A couple of funnies on the CRAB side, it keeps asking me for my GRID certificate passphrase every time I do a command even though I just set up an 8-day proxy, and status query returns various errors. Things look to be working on the Condor server though. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,962,928 RAC: 12,885 |
These 150 event jobs complete quickly, sometimes too quickly... Short exit status: 0 Job Running time in seconds: 1063 Job runtime is less than 20minutes. Sleeping 137 Not a long sleep time wise but still represents more than 10% of the job. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,962,928 RAC: 12,885 |
This job never really got started... ==== CMSSW JOB Execution started at Tue Jan 26 12:34:57 2016 ==== 2016-01-26 12:34:58,508:INFO:CMSSW:User files are 2016-01-26 12:34:58,508:INFO:CMSSW:User sandboxes are sandbox.tar.gz 2016-01-26 12:34:58,508:INFO:CMSSW:CMSSW configured for 1 cores 2016-01-26 12:34:58,508:INFO:CMSSW:Executing CMSSW step 2016-01-26 12:34:58,508:INFO:CMSSW:Runing SCRAM 2016-01-26 12:34:59,085:INFO:CMSSW:Running PRE scripts 2016-01-26 12:34:59,085:INFO:CMSSW:RUNNING SCRAM SCRIPTS 2016-01-26 12:34:59,085:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/TweakPSet.py --location=/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923 --inputFile='job_input_file_list_315.txt' --runAndLumis='job_lumis_315.json' --firstEvent=47101 --lastEvent=47251 --firstLumi=629 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-26 12:35:00,461:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] Job Running time in seconds: 8 Job runtime is less than 20minutes. Sleeping 1192 /home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/condor_exec.exe: line 12: 16224 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode 2016-01-26 12:37:03,515:CRITICAL:CMSSW:Error running cmsRun {'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']} Return code: -9 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
Hmm, something pathological. Was the machine otherwise busy? The log file is empty on my end. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,962,928 RAC: 12,885 |
It has been busy doing the same thing for the last 24 hours, nothing new or different happened. Did you want the _condor_stdout posting ? It said it was going to sleep for 1192, I assume seconds, but maybe that bit got 'killed' ? Anyway, the next job started up and finished okay... ======== gWMS-CMSRunAnalysis.sh STARTING at Tue Jan 26 12:37:07 GMT 2016 on 246-874-6332 ======== Local time : Tue Jan 26 13:37:07 CET 2016 Current system : Linux 246-874-6332 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=319 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_319.txt --runAndLumis=job_lumis_319.json --lheInputFiles=False --firstEvent=47701 --firstLumi=637 --lastEvent=47851 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} SCRAM_ARCH=slc6_amd64_gcc472 ..snipped.. ======== Stageout at Tue Jan 26 12:55:30 GMT 2016 FINISHING (short status 0) ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Jan 26 12:55:30 GMT 2016 on 246-874-6332 with (short) status 0 ======== Local time: Tue Jan 26 13:55:30 CET 2016 Short exit status: 0 Job Running time in seconds: 1103 Job runtime is less than 20minutes. Sleeping 97 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
No, that's fine I think. Just keep an eye out for more errors. It looks like bit-rot somewhere so if it happens again I'd suggest resetting the project. Unless you're running right up against your memory limit, might be worth double-checking that. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,962,928 RAC: 12,885 |
Memory usage is low, system monitor reports 3.7Gb used out of 31.3Gb Will keep an eye out, with them usually taking 20 minutes it is easy to spot an odd one and investigate. Can you keep track of that work to see if it also fails again when repeated to see if it is at fault ? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Just an upfront reminder: We are going to need a new batch of jobs fairly soon. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 581 |
Just an upfront reminder: We are going to need a new batch of jobs fairly soon.
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The shutdown of task, when no jobs are available does not seem to be working. It is now 1h+ since the last job was uploaded and the task is still running. Glidein is run again and again (run-4, run-5 etc). Another issue is, that even if it works, it would then start a new CMS task, shut it down,over and over again, until there are new jobs available or boinc tasks are running out.(unless you select "no new tasks") |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
Just an upfront reminder: We are going to need a new batch of jobs fairly soon. Yeah, thanks. Just woke up -- the tail-end went a bit more quickly than I expected. New batch submitted and everyone wanting jobs now has one as far as I can tell. Sorry for the slight delay, but it is the weekend... :-) |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 581 |
Ivan is submitting the jobs for a new batch: 160213_102826:ireid_crab_CMS_at_Home_MinBias_250evB It seems there will be another 10,000 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No Problem. I should have said something sooner. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
Yep, same as the last batch, only the name has been changed to protect my sanity... |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
No Problem. 'S'OK, I saw we were running down late last night, but didn't expect the end to be quite so soon. I'm awake now (FCVO "awake" -- must buy more coffee :-) and jobs are flowing. To paraphrase Heinlein, "The Jobs Must Roll"! |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 581 |
... and jobs are flowing. Hopefully not a false start: 155 Running - 1 Failed - 54 ToRetry |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Last batch was not bad at all. Just over 2% fails. (But we can do better, can't we?) |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,972,027 RAC: 2,920 |
... and jobs are flowing. We always seem to have a bad start-up, not sure why. Possibly related to a bit of manual trickery I have to do. Jobs as originally submitted don't have all the right characteristics for the project, and therefore aren't started by Condor. So, I have to log on to the Condor server and run a script that modifies job characteristics to make them suitable. The script also creates a derivative (batch specific) script which is run as a cron job every five minutes (because the modification only applies to jobs in the queue, and Condor maintains a queue of ~1,000 jobs, moving pending jobs into the active queue as running jobs finish). It's possible that jobs meet Condor's runnable criteria before all of the modifications are done (it takes several seconds to run the script) but before critical modifications are made. At the moment it's not that important, but if you happen to have a summer student willing to investigate this, please contact me. :-) [Edit] To-retry count is now zero! [/Edit] |
©2024 CERN