Message boards :
News :
No new jobs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Jobs are running low. Unless new ones are submitted, we are going to run out in about 8 hours. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Jobs are running low. Thanks, I was just about to check. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. How many jobs did you put on? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Thanks, Ivan. Another 5,000 Minimum Bias jobs, should see us through another five days or so. I'm making progress on getting things set up to transfer results from the Data Bridge where they now reside to GRID-accessible storage, but it's all a bit more work than I was originally led to believe... We're in the early stages of writing a paper on the work so far; I'll be sure to let you know when it comes out and how to access it. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Number of Unknown-status jobs has shot up a bit on the status page ! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Number of Unknown-status jobs has shot up a bit on the status page ! Yes, I noticed that. However, they all returned their output. Dashboard says they all ran for 24h00m so it looks like maybe someone's stable (anyone running 29 machines?) had some reporting glitch. Hmm, that may explain the spike in the graph of running jobs this past day. Most of them started between 1300 and 1400 yesterday. I'll have to dig around in the logs in case there's a clue as to which machines they were. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
I have found 1 job that seems to be running fine apart from the % done is at 1.753% after 19:24:10 elapsed and 4:24 to go. CPU used is only 8:56 minutes. This was downloaded at 13:53:15 BST yesterday. Looking at the logs shows it has been doing all the right stuff but guess something has caused it to be disconnected. Guessing that is what the increased Unknown-status jobs are. Something happened your end around that time ? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
I have found 1 job that seems to be running fine apart from the % done is at 1.753% after 19:24:10 elapsed and 4:24 to go. CPU used is only 8:56 minutes. Can you tell me which machine that is? Bit hard to tell if it's at "my end" as the Condor machine serving up the jobs is in Oxfordshire and the data-bridge file-store is at CERN. :-) Can you poke around the logs and see if there's a job number -- actually, in the stdout can you see the event numbers? (EvNo/125)-1 will give me the job number. Thanks |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
From cmsRun-stdout.log: Begin processing the 32nd record. Run 1, Event 538532, LumiSection 8617 at 01-Oct-2015 10:09:13.020 BST From Scram log: Beginning TweakPSet arguments: ['/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393/TweakPSet.py', '--location=/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393', '--inputFile=job_input_file_list_4309.txt', '--runAndLumis=job_lumis_4309.json', '--firstEvent=538501', '--lastEvent=538626', '--firstLumi=8617', '--firstRun=1', '--seeding=AutomaticSeeding', '--lheInputFiles=False', '--oneEventMode=0', '--eventsPerLumi=100'] That enough or more from somewhere else ? Edit: Machine ID is 549 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
From cmsRun-stdout.log: Yep, got it, job 4309. It seemed to take a while to start up: ==== CMSSW Stack Execution FINISHING at Thu Oct 1 09:27:35 2015 ====... == CMSSW: Begin processing the 1st record. Run 1, Event 538501, LumiSection 8617 at 01-Oct-2015 10:03:00.513 BST... == CMSSW: Begin processing the 125th record. Run 1, Event 538625, LumiSection 8618 at 01-Oct-2015 10:27:32.984 BST... ======== CMSRunAnalysis.py FINISHING at Thu Oct 1 09:27:37 GMT 2015 ========... Copying 41060845 bytes file:///home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393/step1.root => https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_nBias/150927_090635/0004/step1_4309.root ... ======== Stageout at Thu Oct 1 09:32:34 GMT 2015 FINISHING (short status 0) ======== And the output is on the data-bridge. Also, Dashboard has it as "finished" so it doesn't seem to be problematic as far as CRAB3 is concerned. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
What about jobs 3920 and 3921 ? These were the first 2 jobs run, the second one starting only 2 minutes after the first one. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
It's now coming up to 26 hours elapsed and is running job 4538. Still at 1.752% (1.753% is what it shows in Properties) and CPU used is still the same. What will make it think it has done enough and stop ? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
It's now coming up to 26 hours elapsed and is running job 4538. 3920 and 3921 are both reported as finished and their output has arrived. Dashboard has 4538 as "pending" but it has produced its output file and the log file has come back to the server : ======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Oct 1 14:59:16 GMT 2015 on 246-549-25760 with (short) status 0 ======== Local time: Thu Oct 1 15:59:16 BST 2015 Short exit status: 0 Job Running time in seconds: 2157 Do you see that in the log there? Are there any later logs? I've no idea what is holding it, but it should be safe to abort. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Yes, logs were present for 4538. It had moved on to 4562, I clicked Suspend but the logs kept getting updated, after a few minutes I clicked Abort and access to the logs went in a few seconds. So 4562 should be incomplete at your end. The old task has reported back in my list and has as many points as a typical judge in the Eurovision song contest gives one of our songs ! A new task has been downloaded and started. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Yes, logs were present for 4538. OK, don't worry about 4562, it will time-out on the Condor server and resubmit. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
NOTE: An update to the job-submission server has meant that a new batch of jobs I submitted today has gone into limbo. I'm waiting to hear back from CERN when the problem is solved. I'm not sure if the batch will start up automagically when things are fixed or whether I'll have to resubmit. So, we're going to be out of jobs soon, unfortunately, and I don't have a timescale for restart. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
... it will time-out on the Condor server and resubmit. How long is this time-out ? |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
... So, we're going to be out of jobs soon, unfortunately, and I don't have a timescale for restart. Okay, have set 4 of 5 machines to no new work Please repost when new work will come through again |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
... it will time-out on the Condor server and resubmit. Two hours. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
How long is this time-out ? ... Two hours. Outsch. Way too short for normal BOINC-Users As Long as this is an Alpha-Project this will be okay but for later on you should re-think about this |
©2025 CERN