Message boards : News : No new jobs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1115 - Posted: 27 Sep 2015, 7:49:47 UTC

Jobs are running low.
Unless new ones are submitted, we are going to run out in about 8 hours.
ID: 1115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1116 - Posted: 27 Sep 2015, 9:00:24 UTC - in response to Message 1115.  

Jobs are running low.
Unless new ones are submitted, we are going to run out in about 8 hours.

Thanks, I was just about to check.
ID: 1116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1117 - Posted: 27 Sep 2015, 9:33:28 UTC - in response to Message 1116.  

Thanks, Ivan.
How many jobs did you put on?
ID: 1117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1118 - Posted: 27 Sep 2015, 18:25:20 UTC - in response to Message 1117.  

Thanks, Ivan.
How many jobs did you put on?

Another 5,000 Minimum Bias jobs, should see us through another five days or so. I'm making progress on getting things set up to transfer results from the Data Bridge where they now reside to GRID-accessible storage, but it's all a bit more work than I was originally led to believe...
We're in the early stages of writing a paper on the work so far; I'll be sure to let you know when it comes out and how to access it.
ID: 1118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1127 - Posted: 30 Sep 2015, 14:29:36 UTC - in response to Message 1118.  

Number of Unknown-status jobs has shot up a bit on the status page !
ID: 1127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1128 - Posted: 30 Sep 2015, 18:10:25 UTC - in response to Message 1127.  

Number of Unknown-status jobs has shot up a bit on the status page !

Yes, I noticed that. However, they all returned their output. Dashboard says they all ran for 24h00m so it looks like maybe someone's stable (anyone running 29 machines?) had some reporting glitch. Hmm, that may explain the spike in the graph of running jobs this past day. Most of them started between 1300 and 1400 yesterday. I'll have to dig around in the logs in case there's a clue as to which machines they were.
ID: 1128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1130 - Posted: 1 Oct 2015, 8:35:28 UTC - in response to Message 1128.  

I have found 1 job that seems to be running fine apart from the % done is at 1.753% after 19:24:10 elapsed and 4:24 to go. CPU used is only 8:56 minutes.

This was downloaded at 13:53:15 BST yesterday.

Looking at the logs shows it has been doing all the right stuff but guess something has caused it to be disconnected. Guessing that is what the increased Unknown-status jobs are. Something happened your end around that time ?
ID: 1130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1131 - Posted: 1 Oct 2015, 9:03:09 UTC - in response to Message 1130.  
Last modified: 1 Oct 2015, 9:07:29 UTC

I have found 1 job that seems to be running fine apart from the % done is at 1.753% after 19:24:10 elapsed and 4:24 to go. CPU used is only 8:56 minutes.

This was downloaded at 13:53:15 BST yesterday.

Looking at the logs shows it has been doing all the right stuff but guess something has caused it to be disconnected. Guessing that is what the increased Unknown-status jobs are. Something happened your end around that time ?

Can you tell me which machine that is? Bit hard to tell if it's at "my end" as the Condor machine serving up the jobs is in Oxfordshire and the data-bridge file-store is at CERN. :-) Can you poke around the logs and see if there's a job number -- actually, in the stdout can you see the event numbers? (EvNo/125)-1 will give me the job number.
Thanks
ID: 1131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1133 - Posted: 1 Oct 2015, 9:12:02 UTC - in response to Message 1131.  
Last modified: 1 Oct 2015, 9:18:45 UTC

From cmsRun-stdout.log:

Begin processing the 32nd record. Run 1, Event 538532, LumiSection 8617 at 01-Oct-2015 10:09:13.020 BST

From Scram log:
Beginning TweakPSet
arguments: ['/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393/TweakPSet.py', '--location=/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393', '--inputFile=job_input_file_list_4309.txt', '--runAndLumis=job_lumis_4309.json', '--firstEvent=538501', '--lastEvent=538626', '--firstLumi=8617', '--firstRun=1', '--seeding=AutomaticSeeding', '--lheInputFiles=False', '--oneEventMode=0', '--eventsPerLumi=100']

That enough or more from somewhere else ?

Edit: Machine ID is 549
ID: 1133 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1134 - Posted: 1 Oct 2015, 11:42:40 UTC - in response to Message 1133.  
Last modified: 2 Oct 2015, 21:53:05 UTC

From cmsRun-stdout.log:

Begin processing the 32nd record. Run 1, Event 538532, LumiSection 8617 at 01-Oct-2015 10:09:13.020 BST

From Scram log:
Beginning TweakPSet
arguments: ['/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393/TweakPSet.py', '--location=/home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393', '--inputFile=job_input_file_list_4309.txt', '--runAndLumis=job_lumis_4309.json', '--firstEvent=538501', '--lastEvent=538626', '--firstLumi=8617', '--firstRun=1', '--seeding=AutomaticSeeding', '--lheInputFiles=False', '--oneEventMode=0', '--eventsPerLumi=100']

That enough or more from somewhere else ?

Edit: Machine ID is 549

Yep, got it, job 4309. It seemed to take a while to start up:
==== CMSSW Stack Execution FINISHING at Thu Oct  1 09:27:35 2015 ====
...
== CMSSW:  Begin processing the 1st record. Run 1, Event 538501, LumiSection 8617 at 01-Oct-2015 10:03:00.513 BST
...
== CMSSW:  Begin processing the 125th record. Run 1, Event 538625, LumiSection 8618 at 01-Oct-2015 10:27:32.984 BST
...
======== CMSRunAnalysis.py FINISHING at Thu Oct  1 09:27:37 GMT 2015 ========
...
Copying 41060845 bytes file:///home/boinc/CMSRun/glide_qKUK1C/execute/dir_13393/step1.root => https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_nBias/150927_090635/0004/step1_4309.root
...
======== Stageout at Thu Oct  1 09:32:34 GMT 2015 FINISHING (short status 0) ========

And the output is on the data-bridge. Also, Dashboard has it as "finished" so it doesn't seem to be problematic as far as CRAB3 is concerned.
ID: 1134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1135 - Posted: 1 Oct 2015, 12:50:11 UTC - in response to Message 1134.  

What about jobs 3920 and 3921 ?
These were the first 2 jobs run, the second one starting only 2 minutes after the first one.
ID: 1135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1136 - Posted: 1 Oct 2015, 14:58:35 UTC - in response to Message 1135.  

It's now coming up to 26 hours elapsed and is running job 4538.
Still at 1.752% (1.753% is what it shows in Properties) and CPU used is still the same.

What will make it think it has done enough and stop ?
ID: 1136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1137 - Posted: 1 Oct 2015, 15:13:54 UTC - in response to Message 1136.  

It's now coming up to 26 hours elapsed and is running job 4538.
Still at 1.752% (1.753% is what it shows in Properties) and CPU used is still the same.

What will make it think it has done enough and stop ?

3920 and 3921 are both reported as finished and their output has arrived.
Dashboard has 4538 as "pending" but it has produced its output file and the log file has come back to the server :
======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Oct 1 14:59:16 GMT 2015 on 246-549-25760 with (short) status 0 ========
Local time: Thu Oct 1 15:59:16 BST 2015
Short exit status: 0
Job Running time in seconds: 2157

Do you see that in the log there? Are there any later logs?
I've no idea what is holding it, but it should be safe to abort.
ID: 1137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,874
RAC: 15,085
Message 1138 - Posted: 1 Oct 2015, 15:31:21 UTC - in response to Message 1137.  

Yes, logs were present for 4538.
It had moved on to 4562, I clicked Suspend but the logs kept getting updated, after a few minutes I clicked Abort and access to the logs went in a few seconds. So 4562 should be incomplete at your end.

The old task has reported back in my list and has as many points as a typical judge in the Eurovision song contest gives one of our songs !

A new task has been downloaded and started.
ID: 1138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1139 - Posted: 1 Oct 2015, 16:29:41 UTC - in response to Message 1138.  

Yes, logs were present for 4538.
It had moved on to 4562, I clicked Suspend but the logs kept getting updated, after a few minutes I clicked Abort and access to the logs went in a few seconds. So 4562 should be incomplete at your end.

The old task has reported back in my list and has as many points as a typical judge in the Eurovision song contest gives one of our songs !

A new task has been downloaded and started.

OK, don't worry about 4562, it will time-out on the Condor server and resubmit.
ID: 1139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1140 - Posted: 1 Oct 2015, 16:33:22 UTC

NOTE: An update to the job-submission server has meant that a new batch of jobs I submitted today has gone into limbo. I'm waiting to hear back from CERN when the problem is solved. I'm not sure if the batch will start up automagically when things are fixed or whether I'll have to resubmit. So, we're going to be out of jobs soon, unfortunately, and I don't have a timescale for restart.
ID: 1140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1141 - Posted: 1 Oct 2015, 16:37:47 UTC - in response to Message 1139.  
Last modified: 1 Oct 2015, 16:38:00 UTC

... it will time-out on the Condor server and resubmit.

How long is this time-out ?
ID: 1141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1142 - Posted: 1 Oct 2015, 16:40:28 UTC - in response to Message 1140.  

... So, we're going to be out of jobs soon, unfortunately, and I don't have a timescale for restart.

Okay, have set 4 of 5 machines to no new work

Please repost when new work will come through again
ID: 1142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1143 - Posted: 1 Oct 2015, 17:45:09 UTC - in response to Message 1141.  

... it will time-out on the Condor server and resubmit.

How long is this time-out ?

Two hours.
ID: 1143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1144 - Posted: 1 Oct 2015, 17:56:08 UTC - in response to Message 1143.  

How long is this time-out ? ... Two hours.


Outsch. Way too short for normal BOINC-Users

As Long as this is an Alpha-Project this will be okay but for later on you should re-think about this
ID: 1144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

Message boards : News : No new jobs


©2024 CERN