Thread 'No new jobs'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1636 - Posted: 26 Jan 2016, 1:47:49 UTC - in response to Message 1635. What is 151? An exit code? Yes, that's a stage-out failure. Often, but not necessarily, a time-out in returning the result file to the data-bridge (one hour limit). Recently often caused by volunteers running too many hosts and overloading their upload link. ID: 1636 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1637 - Posted: 26 Jan 2016, 1:48:51 UTC - in response to Message 1635. We seem to have had a systemic increase in failures from mid-Friday until Sunday, I think, that there is an automatic increase in failures towards the end of a batch as inherently faulty job are passed through the 3 attempts to finally fail. There are 126 failures that went through all 3 attempts, a hand full 2 attempts and the rest(about 165) failed at the 1st attempt. There is that effect, to be sure. ID: 1637 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1638 - Posted: 26 Jan 2016, 2:11:25 UTC - in response to Message 1634. New batch seems to be up and running. A couple of funnies on the CRAB side, it keeps asking me for my GRID certificate passphrase every time I do a command even though I just set up an 8-day proxy, and status query returns various errors. Things look to be working on the Condor server though. ID: 1638 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1639 - Posted: 26 Jan 2016, 8:42:10 UTC - in response to Message 1638. These 150 event jobs complete quickly, sometimes too quickly... Short exit status: 0 Job Running time in seconds: 1063 Job runtime is less than 20minutes. Sleeping 137 Not a long sleep time wise but still represents more than 10% of the job. ID: 1639 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1640 - Posted: 26 Jan 2016, 13:15:09 UTC - in response to Message 1639. This job never really got started... ==== CMSSW JOB Execution started at Tue Jan 26 12:34:57 2016 ==== 2016-01-26 12:34:58,508:INFO:CMSSW:User files are 2016-01-26 12:34:58,508:INFO:CMSSW:User sandboxes are sandbox.tar.gz 2016-01-26 12:34:58,508:INFO:CMSSW:CMSSW configured for 1 cores 2016-01-26 12:34:58,508:INFO:CMSSW:Executing CMSSW step 2016-01-26 12:34:58,508:INFO:CMSSW:Runing SCRAM 2016-01-26 12:34:59,085:INFO:CMSSW:Running PRE scripts 2016-01-26 12:34:59,085:INFO:CMSSW:RUNNING SCRAM SCRIPTS 2016-01-26 12:34:59,085:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/TweakPSet.py --location=/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923 --inputFile='job_input_file_list_315.txt' --runAndLumis='job_lumis_315.json' --firstEvent=47101 --lastEvent=47251 --firstLumi=629 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-26 12:35:00,461:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] Job Running time in seconds: 8 Job runtime is less than 20minutes. Sleeping 1192 /home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/condor_exec.exe: line 12: 16224 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode 2016-01-26 12:37:03,515:CRITICAL:CMSSW:Error running cmsRun {'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_VZ66Ns/execute/dir_15923/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']} Return code: -9 ID: 1640 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1641 - Posted: 26 Jan 2016, 14:19:26 UTC - in response to Message 1640. Hmm, something pathological. Was the machine otherwise busy? The log file is empty on my end. ID: 1641 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1642 - Posted: 26 Jan 2016, 14:46:35 UTC - in response to Message 1641. It has been busy doing the same thing for the last 24 hours, nothing new or different happened. Did you want the _condor_stdout posting ? It said it was going to sleep for 1192, I assume seconds, but maybe that bit got 'killed' ? Anyway, the next job started up and finished okay... ======== gWMS-CMSRunAnalysis.sh STARTING at Tue Jan 26 12:37:07 GMT 2016 on 246-874-6332 ======== Local time : Tue Jan 26 13:37:07 CET 2016 Current system : Linux 246-874-6332 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=319 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_319.txt --runAndLumis=job_lumis_319.json --lheInputFiles=False --firstEvent=47701 --firstLumi=637 --lastEvent=47851 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} SCRAM_ARCH=slc6_amd64_gcc472 ..snipped.. ======== Stageout at Tue Jan 26 12:55:30 GMT 2016 FINISHING (short status 0) ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Jan 26 12:55:30 GMT 2016 on 246-874-6332 with (short) status 0 ======== Local time: Tue Jan 26 13:55:30 CET 2016 Short exit status: 0 Job Running time in seconds: 1103 Job runtime is less than 20minutes. Sleeping 97 ID: 1642 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1643 - Posted: 26 Jan 2016, 15:29:37 UTC - in response to Message 1642. No, that's fine I think. Just keep an eye out for more errors. It looks like bit-rot somewhere so if it happens again I'd suggest resetting the project. Unless you're running right up against your memory limit, might be worth double-checking that. ID: 1643 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1645 - Posted: 26 Jan 2016, 15:49:36 UTC - in response to Message 1643. Memory usage is low, system monitor reports 3.7Gb used out of 31.3Gb Will keep an eye out, with them usually taking 20 minutes it is easy to spot an odd one and investigate. Can you keep track of that work to see if it also fails again when repeated to see if it is at fault ? ID: 1645 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1975 - Posted: 12 Feb 2016, 23:08:03 UTC Just an upfront reminder: We are going to need a new batch of jobs fairly soon. ID: 1975 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 1976 - Posted: 13 Feb 2016, 8:06:16 UTC - in response to Message 1975. Just an upfront reminder: We are going to need a new batch of jobs fairly soon. The final sprint (49 yards left on the obstacle course: 160130_122133:ireid_crab_CMS_at_Home_MinBias_250evA ) ID: 1976 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1977 - Posted: 13 Feb 2016, 9:42:54 UTC The shutdown of task, when no jobs are available does not seem to be working. It is now 1h+ since the last job was uploaded and the task is still running. Glidein is run again and again (run-4, run-5 etc). Another issue is, that even if it works, it would then start a new CMS task, shut it down,over and over again, until there are new jobs available or boinc tasks are running out.(unless you select "no new tasks") ID: 1977 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1979 - Posted: 13 Feb 2016, 10:43:43 UTC - in response to Message 1975. Just an upfront reminder: We are going to need a new batch of jobs fairly soon. Yeah, thanks. Just woke up -- the tail-end went a bit more quickly than I expected. New batch submitted and everyone wanting jobs now has one as far as I can tell. Sorry for the slight delay, but it is the weekend... :-) ID: 1979 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 1980 - Posted: 13 Feb 2016, 10:44:19 UTC Ivan is submitting the jobs for a new batch: 160213_102826:ireid_crab_CMS_at_Home_MinBias_250evB It seems there will be another 10,000 ID: 1980 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1981 - Posted: 13 Feb 2016, 10:58:42 UTC - in response to Message 1979. No Problem. I should have said something sooner. ID: 1981 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1982 - Posted: 13 Feb 2016, 10:59:00 UTC - in response to Message 1980. Yep, same as the last batch, only the name has been changed to protect my sanity... ID: 1982 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1983 - Posted: 13 Feb 2016, 11:05:02 UTC - in response to Message 1981. No Problem. I should have said something sooner. 'S'OK, I saw we were running down late last night, but didn't expect the end to be quite so soon. I'm awake now (FCVO "awake" -- must buy more coffee :-) and jobs are flowing. To paraphrase Heinlein, "The Jobs Must Roll"! ID: 1983 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 1984 - Posted: 13 Feb 2016, 11:07:26 UTC - in response to Message 1983. ... and jobs are flowing. Hopefully not a false start: 155 Running - 1 Failed - 54 ToRetry ID: 1984 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1985 - Posted: 13 Feb 2016, 11:32:54 UTC - in response to Message 1983. Last batch was not bad at all. Just over 2% fails. (But we can do better, can't we?) ID: 1985 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1988 - Posted: 13 Feb 2016, 11:52:04 UTC - in response to Message 1984. Last modified: 13 Feb 2016, 11:55:30 UTC ... and jobs are flowing. Hopefully not a false start: 155 Running - 1 Failed - 54 ToRetry We always seem to have a bad start-up, not sure why. Possibly related to a bit of manual trickery I have to do. Jobs as originally submitted don't have all the right characteristics for the project, and therefore aren't started by Condor. So, I have to log on to the Condor server and run a script that modifies job characteristics to make them suitable. The script also creates a derivative (batch specific) script which is run as a cron job every five minutes (because the modification only applies to jobs in the queue, and Condor maintains a queue of ~1,000 jobs, moving pending jobs into the active queue as running jobs finish). It's possible that jobs meet Condor's runnable criteria before all of the modifications are done (it takes several seconds to run the script) but before critical modifications are made. At the moment it's not that important, but if you happen to have a summer student willing to investigate this, please contact me. :-) [Edit] To-retry count is now zero! [/Edit] ID: 1988 · Rating: 0 · rate: / Reply Quote

Development for LHC@home