Message boards :
News :
Out Of Jobs
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
We are out of jobs and are fighting will a few other issues. I have stopped new tasks being sent for now. A feature to handle this situation more gracefully is on the work plan. I hope that we can be back running after the weekend. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Laurence, I'm back on deck and I guess our messages are crossing... Sent a new batch of 100 jobs at 1050 and about 36 have been picked up so far. Can you enable a few more new tasks and we'll see how they go? Thanks, ivan |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
BOINC-tasks ready to send 0 After work request BOINC tells: CMS-dev 26 Feb 12:54:46 Project has no tasks available No jobs can be picked up by new requests. 73 BOINC-tasks in progress should be enough to drain your tiny well of 100 jobs. Have 1 job running in my standalone CMS-VM outside of BOINC. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
So far, one successful job -- and three cvmfs errors from another user. PM sent... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What is the status? Are we "go" again? What was wrong and what happens to the other batches? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
What is the status?Both our CRAB3 Condor server and I myself seem to be back on our feet. Are we "go" again?Well, we seem to have some users who have stuck around. Laurence has started tasks flowing again from this site, there are a couple of tasks still running from vLHC but they don't seem to have started our jobs there again What was wrong and what happens to the other batches?I'm not sure exactly what all the problems were. Time-outs were being adjusted to assist with suspend/resume, we did run out of disk space briefly, and it's likely that an older batch also ran out of proxy time. In the middle of all this we were all busy patching everything Linux for the glibc vulnerability that CMS (and possibly the rest of CERN, WLCG, etc) had put a deadline of February 24th to fix; that seems to have adversely affected some servers and VMs. We seem to have lost the spool directory for the 160219 batch; the 160216 batch is still there but not active -- I've given it a new proxy but it doesn't seem to be even twitching. Currently Condor only knows the batches I submitted today and a couple of zombies from last week that don't want to die properly. Usual problem of too many things going on at once, I suppose. Let's give it a few days and see if we keep recovering. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. That is the kind of thing, that we need to hear every once in a while, instead of dead silence. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Thanks, Ivan. Unfortunately, I didn't take my modem into my sick-bed with me. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Unfortunately, I didn't take my modem into my sick-bed with me. :-) I did not mean you.There are others, or not? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
[quote]What is the status?Both our CRAB3 Condor server and I myself seem to be back on our feet. Are we "go" again? Started a box up, the "stuck" job resumed, the next glidein run picked up a job (159 from the last batch) and all is sweetness and light once again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
I do see a potential problem -- condor_q says we have 92 jobs running, condor_status says there are 49 running. |
©2024 CERN