Message boards : News : No new jobs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1636 - Posted: 26 Jan 2016, 1:47:49 UTC - in response to Message 1635.  

What is 151? An exit code?

Yes, that's a stage-out failure. Often, but not necessarily, a time-out in returning the result file to the data-bridge (one hour limit). Recently often caused by volunteers running too many hosts and overloading their upload link.
ID: 1636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1637 - Posted: 26 Jan 2016, 1:48:51 UTC - in response to Message 1635.  

We seem to have had a systemic increase in failures from mid-Friday until Sunday,


I think, that there is an automatic increase in failures towards the end of a batch as inherently faulty job are passed through the 3 attempts to finally fail.
There are 126 failures that went through all 3 attempts, a hand full 2 attempts and the rest(about 165) failed at the 1st attempt.

There is that effect, to be sure.
ID: 1637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1638 - Posted: 26 Jan 2016, 2:11:25 UTC - in response to Message 1634.  

New batch seems to be up and running. A couple of funnies on the CRAB side, it keeps asking me for my GRID certificate passphrase every time I do a command even though I just set up an 8-day proxy, and status query returns various errors. Things look to be working on the Condor server though.
ID: 1638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,962,928
RAC: 12,885
Message 1639 - Posted: 26 Jan 2016, 8:42:10 UTC - in response to Message 1638.  

These 150 event jobs complete quickly, sometimes too quickly...

Short exit status: 0
Job Running time in seconds: 1063
Job runtime is less than 20minutes. Sleeping 137

Not a long sleep time wise but still represents more than 10% of the job.
ID: 1639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,962,928
RAC: 12,885
Message 1640 - Posted: 26 Jan 2016, 13:15:09 UTC - in response to Message 1639.  

This job never really got started...

==== CMSSW JOB Execution started at Tue Jan 26 12:34:57 2016 ====
2016-01-26 12:34:58,508:INFO:CMSSW:User files are
2016-01-26 12:34:58,508:INFO:CMSSW:User sandboxes are sandbox.tar.gz
2016-01-26 12:34:58,508:INFO:CMSSW:CMSSW configured for 1 cores
2016-01-26 12:34:58,508:INFO:CMSSW:Executing CMSSW step
2016-01-26 12:34:58,508:INFO:CMSSW:Runing SCRAM
2016-01-26 12:34:59,085:INFO:CMSSW:Running PRE scripts
2016-01-26 12:34:59,085:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2016-01-26 12:34:59,085:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/TweakPSet.py --location=/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923 --inputFile='job_input_file_list_315.txt' --runAndLumis='job_lumis_315.json' --firstEvent=47101 --lastEvent=47251 --firstLumi=629 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2016-01-26 12:35:00,461:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
Job Running time in seconds: 8
Job runtime is less than 20minutes. Sleeping 1192
/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/condor_exec.exe: line 12: 16224 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode
2016-01-26 12:37:03,515:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: -9
ID: 1640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1641 - Posted: 26 Jan 2016, 14:19:26 UTC - in response to Message 1640.  

Hmm, something pathological. Was the machine otherwise busy? The log file is empty on my end.
ID: 1641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,962,928
RAC: 12,885
Message 1642 - Posted: 26 Jan 2016, 14:46:35 UTC - in response to Message 1641.  

It has been busy doing the same thing for the last 24 hours, nothing new or different happened. Did you want the _condor_stdout posting ?

It said it was going to sleep for 1192, I assume seconds, but maybe that bit got 'killed' ?

Anyway, the next job started up and finished okay...

======== gWMS-CMSRunAnalysis.sh STARTING at Tue Jan 26 12:37:07 GMT 2016 on 246-874-6332 ========
Local time : Tue Jan 26 13:37:07 CET 2016
Current system : Linux 246-874-6332 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux
Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=319 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_319.txt --runAndLumis=job_lumis_319.json --lheInputFiles=False --firstEvent=47701 --firstLumi=637 --lastEvent=47851 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
SCRAM_ARCH=slc6_amd64_gcc472
..snipped..
======== Stageout at Tue Jan 26 12:55:30 GMT 2016 FINISHING (short status 0) ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Jan 26 12:55:30 GMT 2016 on 246-874-6332 with (short) status 0 ========
Local time: Tue Jan 26 13:55:30 CET 2016
Short exit status: 0
Job Running time in seconds: 1103
Job runtime is less than 20minutes. Sleeping 97
ID: 1642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1643 - Posted: 26 Jan 2016, 15:29:37 UTC - in response to Message 1642.  

No, that's fine I think. Just keep an eye out for more errors. It looks like bit-rot somewhere so if it happens again I'd suggest resetting the project. Unless you're running right up against your memory limit, might be worth double-checking that.
ID: 1643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,962,928
RAC: 12,885
Message 1645 - Posted: 26 Jan 2016, 15:49:36 UTC - in response to Message 1643.  

Memory usage is low, system monitor reports 3.7Gb used out of 31.3Gb

Will keep an eye out, with them usually taking 20 minutes it is easy to spot an odd one and investigate.

Can you keep track of that work to see if it also fails again when repeated to see if it is at fault ?
ID: 1645 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1975 - Posted: 12 Feb 2016, 23:08:03 UTC

Just an upfront reminder: We are going to need a new batch of jobs fairly soon.
ID: 1975 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 581
Message 1976 - Posted: 13 Feb 2016, 8:06:16 UTC - in response to Message 1975.  

Just an upfront reminder: We are going to need a new batch of jobs fairly soon.
    The final sprint (49 yards left on the obstacle course: 160130_122133:ireid_crab_CMS_at_Home_MinBias_250evA )


ID: 1976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1977 - Posted: 13 Feb 2016, 9:42:54 UTC

The shutdown of task, when no jobs are available does not seem to be working.
It is now 1h+ since the last job was uploaded and the task is still running.
Glidein is run again and again (run-4, run-5 etc).

Another issue is, that even if it works, it would then start a new CMS task, shut it down,over and over again, until there are new jobs available or boinc tasks are running out.(unless you select "no new tasks")
ID: 1977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1979 - Posted: 13 Feb 2016, 10:43:43 UTC - in response to Message 1975.  

Just an upfront reminder: We are going to need a new batch of jobs fairly soon.

Yeah, thanks. Just woke up -- the tail-end went a bit more quickly than I expected. New batch submitted and everyone wanting jobs now has one as far as I can tell. Sorry for the slight delay, but it is the weekend... :-)
ID: 1979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 581
Message 1980 - Posted: 13 Feb 2016, 10:44:19 UTC

Ivan is submitting the jobs for a new batch: 160213_102826:ireid_crab_CMS_at_Home_MinBias_250evB

It seems there will be another 10,000
ID: 1980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1981 - Posted: 13 Feb 2016, 10:58:42 UTC - in response to Message 1979.  

No Problem.
I should have said something sooner.
ID: 1981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1982 - Posted: 13 Feb 2016, 10:59:00 UTC - in response to Message 1980.  

Yep, same as the last batch, only the name has been changed to protect my sanity...
ID: 1982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1983 - Posted: 13 Feb 2016, 11:05:02 UTC - in response to Message 1981.  

No Problem.
I should have said something sooner.

'S'OK, I saw we were running down late last night, but didn't expect the end to be quite so soon. I'm awake now (FCVO "awake" -- must buy more coffee :-) and jobs are flowing. To paraphrase Heinlein, "The Jobs Must Roll"!
ID: 1983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 581
Message 1984 - Posted: 13 Feb 2016, 11:07:26 UTC - in response to Message 1983.  

... and jobs are flowing.


Hopefully not a false start: 155 Running - 1 Failed - 54 ToRetry
ID: 1984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1985 - Posted: 13 Feb 2016, 11:32:54 UTC - in response to Message 1983.  

Last batch was not bad at all. Just over 2% fails.
(But we can do better, can't we?)
ID: 1985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1988 - Posted: 13 Feb 2016, 11:52:04 UTC - in response to Message 1984.  
Last modified: 13 Feb 2016, 11:55:30 UTC

... and jobs are flowing.


Hopefully not a false start: 155 Running - 1 Failed - 54 ToRetry

We always seem to have a bad start-up, not sure why. Possibly related to a bit of manual trickery I have to do. Jobs as originally submitted don't have all the right characteristics for the project, and therefore aren't started by Condor. So, I have to log on to the Condor server and run a script that modifies job characteristics to make them suitable. The script also creates a derivative (batch specific) script which is run as a cron job every five minutes (because the modification only applies to jobs in the queue, and Condor maintains a queue of ~1,000 jobs, moving pending jobs into the active queue as running jobs finish). It's possible that jobs meet Condor's runnable criteria before all of the modifications are done (it takes several seconds to run the script) but before critical modifications are made. At the moment it's not that important, but if you happen to have a summer student willing to investigate this, please contact me. :-)

[Edit] To-retry count is now zero! [/Edit]
ID: 1988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 13 · Next

Message boards : News : No new jobs


©2024 CERN