Thread 'No new jobs'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1217 - Posted: 10 Oct 2015, 19:21:18 UTC Job queue is falling below 1000. We are at 718 now. I guess, at the current rate we will run out of jobs in 12h or so? ID: 1217 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1218 - Posted: 10 Oct 2015, 23:04:54 UTC - in response to Message 1217. Job queue is falling below 1000. We are at 718 now. I guess, at the current rate we will run out of jobs in 12h or so? A bit more than that, I think. I'll put a new batch in tomorrow -- oops, this -- morning (it takes a while to get everything running, I'd rather not do it just when I'd prefer to be in bed). Hope it won't spoil anyone's Sunday if we miss a slot or two. ID: 1218 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 1220 - Posted: 11 Oct 2015, 9:20:25 UTC - in response to Message 1218. Last modified: 11 Oct 2015, 9:33:13 UTC Job queue is falling below 1000. We are at 718 now. I guess, at the current rate we will run out of jobs in 12h or so? A bit more than that, I think. I'll put a new batch in tomorrow -- oops, this -- morning (it takes a while to get everything running, I'd rather not do it just when I'd prefer to be in bed). Hope it won't spoil anyone's Sunday if we miss a slot or two. What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19 Is it normal when >2500 Jobs are in running state directly after you submit them? ID: 1220 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1221 - Posted: 11 Oct 2015, 10:25:46 UTC - in response to Message 1220. What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19 It's a tough job, but someone's got to do it! Luckily my broadband's not playing up at the moment otherwise I'd have had to walk the 7 or 800 metres in to work. I caught it with 29 jobs still in the queue. :-) (That's GMT by the way, I don't get up at 0900 on Sundays!) Is it normal when >2500 Jobs are in running state directly after you submit them? That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did, ID: 1221 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1226 - Posted: 12 Oct 2015, 11:27:36 UTC - in response to Message 1221. Is it normal when >2500 Jobs are in running state directly after you submit them? That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did,[/quote] ...but it looks as though many more are failing in this task than the last. Look at the IPs. ID: 1226 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1228 - Posted: 12 Oct 2015, 13:53:45 UTC - in response to Message 1226. Is it normal when >2500 Jobs are in running state directly after you submit them? That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did ...but it looks as though many more are failing in this task than the last. Look at the IPs. Hmm, 5 of the first 6 I looked at were the same IP address, or-67-237-248-3.dhcp.embarqhsd.net which seems to be from CenturyLink out of Monroe, LA. == CMSSW: Begin processing the 1st record. Run 1, Event 10426, LumiSection 418 at 11-Oct-2015 19:14:44.312 CEST == CMSSW: %MSG-e FatalSystemSignal: OscarProducer:g4SimHits 11-Oct-2015 19:14:44 CEST Run: 1 Event: 10426 == CMSSW: A fatal system signal has occurred: segmentation violation and a few others I looked at were also segfaults in the first event; looks like that machine has memory problems or some such. Condor says: [ Type = "NodeStatus"; Node = "Job418"; NodeStatus = 6; /* "STATUS_ERROR" */ StatusDetails = "POST script failed with status 2"; RetryCount = 0; JobProcsQueued = 0; JobProcsHeld = 0; ] so I think it will be re-tried at a later time. ID: 1228 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1229 - Posted: 12 Oct 2015, 14:11:18 UTC - in response to Message 1228. Through the magic of admin privileges, I have identified the user... ID: 1229 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0	Message 1235 - Posted: 12 Oct 2015, 19:00:41 UTC - in response to Message 1229. Through the magic of admin privileges, I have identified the user... Whew! glad its not me.... ID: 1235 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1236 - Posted: 12 Oct 2015, 19:10:33 UTC - in response to Message 1229. Through the magic of admin privileges, I have identified the user... Bearing in mind previous examples, it might be as well to treat IPs with a degree of circumspection. Although there are many failures with that IP, it could be that it is just where the jobs happened to be first sent. As a further example, there is a failure in your previous task 151002_120226:ireid_crab_CMS_at_Home_TTbar18 job 2922 with my IP but timed when none of my hosts were running (they start between 0000 and 0010 GMT) ID: 1236 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1237 - Posted: 12 Oct 2015, 21:06:43 UTC - in response to Message 1236. Through the magic of admin privileges, I have identified the user... Bearing in mind previous examples, it might be as well to treat IPs with a degree of circumspection. Although there are many failures with that IP, it could be that it is just where the jobs happened to be first sent. As a further example, there is a failure in your previous task 151002_120226:ireid_crab_CMS_at_Home_TTbar18 job 2922 with my IP but timed when none of my hosts were running (they start between 0000 and 0010 GMT) Good point, and perhaps underlines the huge pinch of salt I take with Dashboard statistics. In any event, I'm waiting for a response from the user; the reported number of failures seems to have stabilised -- the graph on the Jobs page would seem to bear that out. ID: 1237 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1238 - Posted: 13 Oct 2015, 10:55:03 UTC - in response to Message 1173. We hope to make a large step towards that recognition when I give a presentation at CERN on the 15th Just to wish you well with this. jp. ID: 1238 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1239 - Posted: 13 Oct 2015, 12:59:00 UTC - in response to Message 1238. We hope to make a large step towards that recognition when I give a presentation at CERN on the 15th Just to wish you well with this. jp. Thanks. ID: 1239 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1309 - Posted: 23 Oct 2015, 9:49:04 UTC Jobs are running very low. 36 remaining. ID: 1309 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 162 Credit: 3,373,109 RAC: 10,387	Message 1311 - Posted: 23 Oct 2015, 10:53:31 UTC Last modified: 23 Oct 2015, 10:53:50 UTC My box is sitting idle here and I just found in http://localhost:54201/logs/run-1/glide_TZLLYf/StartdLog: 10/23/15 04:40:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 04:40:33 (pid:7593) Buf::write(): condor_write() failed 10/23/15 04:50:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 04:50:57 (pid:7593) Buf::write(): condor_write() failed ... 10/23/15 07:11:09 (pid:7593) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9619 10/23/15 07:11:09 (pid:7593) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds. 10/23/15 07:12:10 (pid:7593) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#101094 10/23/15 07:16:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 07:16:33 (pid:7593) Buf::write(): condor_write() failed ... 10/23/15 12:38:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 12:38:57 (pid:7593) Buf::write(): condor_write() failed 10/23/15 12:49:21 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 12:49:21 (pid:7593) Buf::write(): condor_write() failed So, what's gpoing on ? ID: 1311 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1313 - Posted: 23 Oct 2015, 11:57:30 UTC - in response to Message 1311. I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply. More jobs just as soon as I can submit them. ID: 1313 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 17 Aug 15 Posts: 62 Credit: 296,695 RAC: 0	Message 1314 - Posted: 23 Oct 2015, 13:04:45 UTC - in response to Message 1313. Last modified: 23 Oct 2015, 13:05:18 UTC At least you are not CPDN. It's been out since a week. Tullio ID: 1314 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1317 - Posted: 23 Oct 2015, 16:47:17 UTC - in response to Message 1313. I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply. More jobs just as soon as I can submit them. I found a better forum to ask my question. Turns out there was a new column added to the database but the preprod server we use wasn't updated. Have to wait for that, I guess, using the production server resulted in communication failure to the Condor server at RAL. ID: 1317 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1320 - Posted: 24 Oct 2015, 10:19:47 UTC - in response to Message 1317. It's still not fixed, I'm afraid. :-( ID: 1320 · Rating: 0 · rate: / Reply Quote

newman Send message Joined: 15 Feb 15 Posts: 10 Credit: 16,387 RAC: 0	Message 1321 - Posted: 24 Oct 2015, 11:03:00 UTC Seems my CMS WU always idle and not getting a job. what does this log mean? 10/24/15 12:50:46 (pid:7822) FILETRANSFER: "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring ID: 1321 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 1323 - Posted: 24 Oct 2015, 12:13:26 UTC - in response to Message 1321. Last modified: 24 Oct 2015, 12:23:52 UTC Seems my CMS WU always idle and not getting a job. what does this log mean? 10/24/15 12:50:46 (pid:7822) FILETRANSFER: "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring I thought I'd responded to a similar query the other day, but I can't find it; maybe I forgot to click the Post button, I sometimes do... See p16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf for an explanation of what that command is supposed to do. Since there's no response, it's likely that the plugin wasn't (down?)loaded properly so the answer as to why may appear earlier in the log file. It could be due to there being no Condor jobs available, though that seems unlikely - however, I'm no expert as to the inner (nor indeed the outer!) workings of Condor. [Edit:] I just checked the logs we get back, and that doesn't appear in anything we get sent back. Not surprising, really; if curl_plugin is non-functional then it can't fetch a Condor job to run it => no job log. So if you or someone else can capture one of these logs for us we might be able to tell more then. [/Edit] ID: 1323 · Rating: 0 · rate: / Reply Quote

Development for LHC@home