Message boards : News : No new jobs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1217 - Posted: 10 Oct 2015, 19:21:18 UTC

Job queue is falling below 1000.
We are at 718 now.
I guess, at the current rate we will run out of jobs in 12h or so?
ID: 1217 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1218 - Posted: 10 Oct 2015, 23:04:54 UTC - in response to Message 1217.  

Job queue is falling below 1000.
We are at 718 now.
I guess, at the current rate we will run out of jobs in 12h or so?

A bit more than that, I think. I'll put a new batch in tomorrow -- oops, this -- morning (it takes a while to get everything running, I'd rather not do it just when I'd prefer to be in bed). Hope it won't spoil anyone's Sunday if we miss a slot or two.
ID: 1218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 1220 - Posted: 11 Oct 2015, 9:20:25 UTC - in response to Message 1218.  
Last modified: 11 Oct 2015, 9:33:13 UTC

Job queue is falling below 1000.
We are at 718 now.
I guess, at the current rate we will run out of jobs in 12h or so?

A bit more than that, I think. I'll put a new batch in tomorrow -- oops, this -- morning (it takes a while to get everything running, I'd rather not do it just when I'd prefer to be in bed). Hope it won't spoil anyone's Sunday if we miss a slot or two.

What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19

Is it normal when >2500 Jobs are in running state directly after you submit them?
ID: 1220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1221 - Posted: 11 Oct 2015, 10:25:46 UTC - in response to Message 1220.  

What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19
It's a tough job, but someone's got to do it! Luckily my broadband's not playing up at the moment otherwise I'd have had to walk the 7 or 800 metres in to work. I caught it with 29 jobs still in the queue. :-) (That's GMT by the way, I don't get up at 0900 on Sundays!)

Is it normal when >2500 Jobs are in running state directly after you submit them?
That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did,
ID: 1221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 1226 - Posted: 12 Oct 2015, 11:27:36 UTC - in response to Message 1221.  

Is it normal when >2500 Jobs are in running state directly after you submit them?
That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did,[/quote]

...but it looks as though many more are failing in this task than the last. Look at the IPs.
ID: 1226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1228 - Posted: 12 Oct 2015, 13:53:45 UTC - in response to Message 1226.  

Is it normal when >2500 Jobs are in running state directly after you submit them?
That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did

...but it looks as though many more are failing in this task than the last. Look at the IPs.

Hmm, 5 of the first 6 I looked at were the same IP address, or-67-237-248-3.dhcp.embarqhsd.net which seems to be from CenturyLink out of Monroe, LA.

== CMSSW: Begin processing the 1st record. Run 1, Event 10426, LumiSection 418 at 11-Oct-2015 19:14:44.312 CEST
== CMSSW: %MSG-e FatalSystemSignal: OscarProducer:g4SimHits 11-Oct-2015 19:14:44 CEST Run: 1 Event: 10426
== CMSSW: A fatal system signal has occurred: segmentation violation

and a few others I looked at were also segfaults in the first event; looks like that machine has memory problems or some such. Condor says:
[
Type = "NodeStatus";
Node = "Job418";
NodeStatus = 6; /* "STATUS_ERROR" */
StatusDetails = "POST script failed with status 2";
RetryCount = 0;
JobProcsQueued = 0;
JobProcsHeld = 0;
]

so I think it will be re-tried at a later time.
ID: 1228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1229 - Posted: 12 Oct 2015, 14:11:18 UTC - in response to Message 1228.  

Through the magic of admin privileges, I have identified the user...
ID: 1229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 1235 - Posted: 12 Oct 2015, 19:00:41 UTC - in response to Message 1229.  

Through the magic of admin privileges, I have identified the user...

Whew! glad its not me....
ID: 1235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 1236 - Posted: 12 Oct 2015, 19:10:33 UTC - in response to Message 1229.  

Through the magic of admin privileges, I have identified the user...

Bearing in mind previous examples, it might be as well to treat IPs with a degree of circumspection. Although there are many failures with that IP, it could be that it is just where the jobs happened to be first sent. As a further example, there is a failure in your previous task 151002_120226:ireid_crab_CMS_at_Home_TTbar18 job 2922 with my IP but timed when none of my hosts were running (they start between 0000 and 0010 GMT)
ID: 1236 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1237 - Posted: 12 Oct 2015, 21:06:43 UTC - in response to Message 1236.  

Through the magic of admin privileges, I have identified the user...

Bearing in mind previous examples, it might be as well to treat IPs with a degree of circumspection. Although there are many failures with that IP, it could be that it is just where the jobs happened to be first sent. As a further example, there is a failure in your previous task 151002_120226:ireid_crab_CMS_at_Home_TTbar18 job 2922 with my IP but timed when none of my hosts were running (they start between 0000 and 0010 GMT)

Good point, and perhaps underlines the huge pinch of salt I take with Dashboard statistics. In any event, I'm waiting for a response from the user; the reported number of failures seems to have stabilised -- the graph on the Jobs page would seem to bear that out.
ID: 1237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 1238 - Posted: 13 Oct 2015, 10:55:03 UTC - in response to Message 1173.  



We hope to make a large step towards that recognition when I give a presentation at CERN on the 15th


Just to wish you well with this.

jp.
ID: 1238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1239 - Posted: 13 Oct 2015, 12:59:00 UTC - in response to Message 1238.  



We hope to make a large step towards that recognition when I give a presentation at CERN on the 15th


Just to wish you well with this.

jp.

Thanks.
ID: 1239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1309 - Posted: 23 Oct 2015, 9:49:04 UTC

Jobs are running very low.
36 remaining.
ID: 1309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1311 - Posted: 23 Oct 2015, 10:53:31 UTC
Last modified: 23 Oct 2015, 10:53:50 UTC

My box is sitting idle here and I just found in http://localhost:54201/logs/run-1/glide_TZLLYf/StartdLog:

10/23/15 04:40:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8
10/23/15 04:40:33 (pid:7593) Buf::write(): condor_write() failed
10/23/15 04:50:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8
10/23/15 04:50:57 (pid:7593) Buf::write(): condor_write() failed

...

10/23/15 07:11:09 (pid:7593) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9619
10/23/15 07:11:09 (pid:7593) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds.
10/23/15 07:12:10 (pid:7593) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#101094
10/23/15 07:16:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8
10/23/15 07:16:33 (pid:7593) Buf::write(): condor_write() failed

...

10/23/15 12:38:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8
10/23/15 12:38:57 (pid:7593) Buf::write(): condor_write() failed
10/23/15 12:49:21 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8
10/23/15 12:49:21 (pid:7593) Buf::write(): condor_write() failed


So, what's gpoing on ?
ID: 1311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1313 - Posted: 23 Oct 2015, 11:57:30 UTC - in response to Message 1311.  

I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply.
More jobs just as soon as I can submit them.
ID: 1313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1314 - Posted: 23 Oct 2015, 13:04:45 UTC - in response to Message 1313.  
Last modified: 23 Oct 2015, 13:05:18 UTC

At least you are not CPDN. It's been out since a week.
Tullio
ID: 1314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1317 - Posted: 23 Oct 2015, 16:47:17 UTC - in response to Message 1313.  

I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply.
More jobs just as soon as I can submit them.

I found a better forum to ask my question. Turns out there was a new column added to the database but the preprod server we use wasn't updated. Have to wait for that, I guess, using the production server resulted in communication failure to the Condor server at RAL.
ID: 1317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1320 - Posted: 24 Oct 2015, 10:19:47 UTC - in response to Message 1317.  

It's still not fixed, I'm afraid. :-(
ID: 1320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
newman

Send message
Joined: 15 Feb 15
Posts: 10
Credit: 16,387
RAC: 0
Message 1321 - Posted: 24 Oct 2015, 11:03:00 UTC

Seems my CMS WU always idle and not getting a job. what does this log mean?

10/24/15 12:50:46 (pid:7822) FILETRANSFER: "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/24/15 12:50:46 (pid:7822) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/24/15 12:50:46 (pid:7822) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
ID: 1321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,901,648
RAC: 2,120
Message 1323 - Posted: 24 Oct 2015, 12:13:26 UTC - in response to Message 1321.  
Last modified: 24 Oct 2015, 12:23:52 UTC

Seems my CMS WU always idle and not getting a job. what does this log mean?

10/24/15 12:50:46 (pid:7822) FILETRANSFER: "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/24/15 12:50:46 (pid:7822) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/24/15 12:50:46 (pid:7822) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring

I thought I'd responded to a similar query the other day, but I can't find it; maybe I forgot to click the Post button, I sometimes do...
See p16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf for an explanation of what that command is supposed to do. Since there's no response, it's likely that the plugin wasn't (down?)loaded properly so the answer as to why may appear earlier in the log file*. It could be due to there being no Condor jobs available, though that seems unlikely - however, I'm no expert as to the inner (nor indeed the outer!) workings of Condor.

*[Edit:] I just checked the logs we get back, and that doesn't appear in anything we get sent back. Not surprising, really; if curl_plugin is non-functional then it can't fetch a Condor job to run it => no job log. So if you or someone else can capture one of these logs for us we might be able to tell more then. [/Edit]
ID: 1323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

Message boards : News : No new jobs


©2024 CERN