Message boards : News : Out Of Jobs
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2091 - Posted: 26 Feb 2016, 10:07:42 UTC

We are out of jobs and are fighting will a few other issues. I have stopped new tasks being sent for now. A feature to handle this situation more gracefully is on the work plan. I hope that we can be back running after the weekend.
ID: 2091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 2092 - Posted: 26 Feb 2016, 11:14:01 UTC - in response to Message 2091.  

Laurence, I'm back on deck and I guess our messages are crossing... Sent a new batch of 100 jobs at 1050 and about 36 have been picked up so far. Can you enable a few more new tasks and we'll see how they go? Thanks, ivan
ID: 2092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 15
Message 2093 - Posted: 26 Feb 2016, 11:59:23 UTC

BOINC-tasks ready to send 0

After work request BOINC tells: CMS-dev 26 Feb 12:54:46 Project has no tasks available

No jobs can be picked up by new requests.
73 BOINC-tasks in progress should be enough to drain your tiny well of 100 jobs.

Have 1 job running in my standalone CMS-VM outside of BOINC.
ID: 2093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 2094 - Posted: 26 Feb 2016, 12:17:02 UTC - in response to Message 2093.  

So far, one successful job -- and three cvmfs errors from another user. PM sent...
ID: 2094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2098 - Posted: 26 Feb 2016, 16:03:08 UTC

What is the status?
Are we "go" again?
What was wrong and what happens to the other batches?
ID: 2098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 2101 - Posted: 26 Feb 2016, 19:05:43 UTC - in response to Message 2098.  

What is the status?
Both our CRAB3 Condor server and I myself seem to be back on our feet.
Are we "go" again?
Well, we seem to have some users who have stuck around. Laurence has started tasks flowing again from this site, there are a couple of tasks still running from vLHC but they don't seem to have started our jobs there again
What was wrong and what happens to the other batches?
I'm not sure exactly what all the problems were. Time-outs were being adjusted to assist with suspend/resume, we did run out of disk space briefly, and it's likely that an older batch also ran out of proxy time. In the middle of all this we were all busy patching everything Linux for the glibc vulnerability that CMS (and possibly the rest of CERN, WLCG, etc) had put a deadline of February 24th to fix; that seems to have adversely affected some servers and VMs. We seem to have lost the spool directory for the 160219 batch; the 160216 batch is still there but not active -- I've given it a new proxy but it doesn't seem to be even twitching. Currently Condor only knows the batches I submitted today and a couple of zombies from last week that don't want to die properly. Usual problem of too many things going on at once, I suppose.
Let's give it a few days and see if we keep recovering.

ID: 2101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2102 - Posted: 26 Feb 2016, 19:10:55 UTC

Thanks, Ivan.
That is the kind of thing, that we need to hear every once in a while, instead of dead silence.
ID: 2102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 2105 - Posted: 26 Feb 2016, 19:27:52 UTC - in response to Message 2102.  

Thanks, Ivan.
That is the kind of thing, that we need to hear every once in a while, instead of dead silence.

Unfortunately, I didn't take my modem into my sick-bed with me. :-)
ID: 2105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2106 - Posted: 26 Feb 2016, 19:43:19 UTC - in response to Message 2105.  

Unfortunately, I didn't take my modem into my sick-bed with me. :-)

I did not mean you.There are others, or not?
ID: 2106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 2108 - Posted: 26 Feb 2016, 20:25:09 UTC - in response to Message 2101.  

[quote]What is the status?
Both our CRAB3 Condor server and I myself seem to be back on our feet.
Are we "go" again?

Started a box up, the "stuck" job resumed, the next glidein run picked up a job
(159 from the last batch) and all is sweetness and light once again.
ID: 2108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 2112 - Posted: 27 Feb 2016, 12:58:12 UTC

I do see a potential problem -- condor_q says we have 92 jobs running, condor_status says there are 49 running.
ID: 2112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Out Of Jobs


©2024 CERN