Message boards : News : No new jobs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Job queue is falling below 1000. We are at 718 now. I guess, at the current rate we will run out of jobs in 12h or so? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Job queue is falling below 1000. A bit more than that, I think. I'll put a new batch in tomorrow -- oops, this -- morning (it takes a while to get everything running, I'd rather not do it just when I'd prefer to be in bed). Hope it won't spoil anyone's Sunday if we miss a slot or two. ![]() |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
Job queue is falling below 1000. What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19 Is it normal when >2500 Jobs are in running state directly after you submit them? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
What's up, Ivan - Working on Sunday? ==> 151011_091025:ireid_crab_CMS_at_Home_TTbar19It's a tough job, but someone's got to do it! Luckily my broadband's not playing up at the moment otherwise I'd have had to walk the 7 or 800 metres in to work. I caught it with 29 jobs still in the queue. :-) (That's GMT by the way, I don't get up at 0900 on Sundays!) Is it normal when >2500 Jobs are in running state directly after you submit them?That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did, ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Is it normal when >2500 Jobs are in running state directly after you submit them?That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did,[/quote] ...but it looks as though many more are failing in this task than the last. Look at the IPs. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Is it normal when >2500 Jobs are in running state directly after you submit them?That's one of the little mysteries of Dashboard. I'm sure Condor didn't tell it that! I half expect them all to go into Unknown like the last batch did Hmm, 5 of the first 6 I looked at were the same IP address, or-67-237-248-3.dhcp.embarqhsd.net which seems to be from CenturyLink out of Monroe, LA. == CMSSW: Begin processing the 1st record. Run 1, Event 10426, LumiSection 418 at 11-Oct-2015 19:14:44.312 CEST == CMSSW: %MSG-e FatalSystemSignal: OscarProducer:g4SimHits 11-Oct-2015 19:14:44 CEST Run: 1 Event: 10426 == CMSSW: A fatal system signal has occurred: segmentation violation and a few others I looked at were also segfaults in the first event; looks like that machine has memory problems or some such. Condor says: [ Type = "NodeStatus"; Node = "Job418"; NodeStatus = 6; /* "STATUS_ERROR" */ StatusDetails = "POST script failed with status 2"; RetryCount = 0; JobProcsQueued = 0; JobProcsHeld = 0; ] so I think it will be re-tried at a later time. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Through the magic of admin privileges, I have identified the user... ![]() |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 ![]() |
Through the magic of admin privileges, I have identified the user... Whew! glad its not me.... |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Through the magic of admin privileges, I have identified the user... Bearing in mind previous examples, it might be as well to treat IPs with a degree of circumspection. Although there are many failures with that IP, it could be that it is just where the jobs happened to be first sent. As a further example, there is a failure in your previous task 151002_120226:ireid_crab_CMS_at_Home_TTbar18 job 2922 with my IP but timed when none of my hosts were running (they start between 0000 and 0010 GMT) |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Through the magic of admin privileges, I have identified the user... Good point, and perhaps underlines the huge pinch of salt I take with Dashboard statistics. In any event, I'm waiting for a response from the user; the reported number of failures seems to have stabilised -- the graph on the Jobs page would seem to bear that out. ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Just to wish you well with this. jp. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Thanks. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Jobs are running very low. 36 remaining. |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
My box is sitting idle here and I just found in http://localhost:54201/logs/run-1/glide_TZLLYf/StartdLog: 10/23/15 04:40:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 04:40:33 (pid:7593) Buf::write(): condor_write() failed 10/23/15 04:50:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 04:50:57 (pid:7593) Buf::write(): condor_write() failed ... 10/23/15 07:11:09 (pid:7593) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9619 10/23/15 07:11:09 (pid:7593) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds. 10/23/15 07:12:10 (pid:7593) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#101094 10/23/15 07:16:33 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 07:16:33 (pid:7593) Buf::write(): condor_write() failed ... 10/23/15 12:38:57 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 12:38:57 (pid:7593) Buf::write(): condor_write() failed 10/23/15 12:49:21 (pid:7593) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 8 10/23/15 12:49:21 (pid:7593) Buf::write(): condor_write() failed So, what's gpoing on ? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply. More jobs just as soon as I can submit them. ![]() |
![]() Send message Joined: 17 Aug 15 Posts: 62 Credit: 296,695 RAC: 0 ![]() |
At least you are not CPDN. It's been out since a week. Tullio |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I've been getting server errors all morning trying to submit a new batch of jobs. A request has been sent to the CRAB experts, I'm waiting on a reply. I found a better forum to ask my question. Turns out there was a new column added to the database but the preprod server we use wasn't updated. Have to wait for that, I guess, using the production server resulted in communication failure to the Condor server at RAL. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
It's still not fixed, I'm afraid. :-( ![]() |
Send message Joined: 15 Feb 15 Posts: 10 Credit: 16,387 RAC: 0 ![]() |
Seems my CMS WU always idle and not getting a job. what does this log mean? 10/24/15 12:50:46 (pid:7822) FILETRANSFER: "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/24/15 12:50:46 (pid:7822) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_ay7YHk/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Seems my CMS WU always idle and not getting a job. what does this log mean? I thought I'd responded to a similar query the other day, but I can't find it; maybe I forgot to click the Post button, I sometimes do... See p16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf for an explanation of what that command is supposed to do. Since there's no response, it's likely that the plugin wasn't (down?)loaded properly so the answer as to why may appear earlier in the log file*. It could be due to there being no Condor jobs available, though that seems unlikely - however, I'm no expert as to the inner (nor indeed the outer!) workings of Condor. *[Edit:] I just checked the logs we get back, and that doesn't appear in anything we get sent back. Not surprising, really; if curl_plugin is non-functional then it can't fetch a Condor job to run it => no job log. So if you or someone else can capture one of these logs for us we might be able to tell more then. [/Edit] ![]() |
©2025 CERN