Thread 'Out Of Jobs'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1160 Credit: 342,328 RAC: 0	Message 2091 - Posted: 26 Feb 2016, 10:07:42 UTC We are out of jobs and are fighting will a few other issues. I have stopped new tasks being sent for now. A feature to handle this situation more gracefully is on the work plan. I hope that we can be back running after the weekend. ID: 2091 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 2092 - Posted: 26 Feb 2016, 11:14:01 UTC - in response to Message 2091. Laurence, I'm back on deck and I guess our messages are crossing... Sent a new batch of 100 jobs at 1050 and about 36 have been picked up so far. Can you enable a few more new tasks and we'll see how they go? Thanks, ivan ID: 2092 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,047,416 RAC: 59	Message 2093 - Posted: 26 Feb 2016, 11:59:23 UTC BOINC-tasks ready to send 0 After work request BOINC tells: CMS-dev 26 Feb 12:54:46 Project has no tasks available No jobs can be picked up by new requests. 73 BOINC-tasks in progress should be enough to drain your tiny well of 100 jobs. Have 1 job running in my standalone CMS-VM outside of BOINC. ID: 2093 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 2094 - Posted: 26 Feb 2016, 12:17:02 UTC - in response to Message 2093. So far, one successful job -- and three cvmfs errors from another user. PM sent... ID: 2094 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2098 - Posted: 26 Feb 2016, 16:03:08 UTC What is the status? Are we "go" again? What was wrong and what happens to the other batches? ID: 2098 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 2101 - Posted: 26 Feb 2016, 19:05:43 UTC - in response to Message 2098. What is the status? Both our CRAB3 Condor server and I myself seem to be back on our feet. Are we "go" again? Well, we seem to have some users who have stuck around. Laurence has started tasks flowing again from this site, there are a couple of tasks still running from vLHC but they don't seem to have started our jobs there again What was wrong and what happens to the other batches? I'm not sure exactly what all the problems were. Time-outs were being adjusted to assist with suspend/resume, we did run out of disk space briefly, and it's likely that an older batch also ran out of proxy time. In the middle of all this we were all busy patching everything Linux for the glibc vulnerability that CMS (and possibly the rest of CERN, WLCG, etc) had put a deadline of February 24th to fix; that seems to have adversely affected some servers and VMs. We seem to have lost the spool directory for the 160219 batch; the 160216 batch is still there but not active -- I've given it a new proxy but it doesn't seem to be even twitching. Currently Condor only knows the batches I submitted today and a couple of zombies from last week that don't want to die properly. Usual problem of too many things going on at once, I suppose. Let's give it a few days and see if we keep recovering. ID: 2101 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2102 - Posted: 26 Feb 2016, 19:10:55 UTC Thanks, Ivan. That is the kind of thing, that we need to hear every once in a while, instead of dead silence. ID: 2102 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 2105 - Posted: 26 Feb 2016, 19:27:52 UTC - in response to Message 2102. Thanks, Ivan. That is the kind of thing, that we need to hear every once in a while, instead of dead silence. Unfortunately, I didn't take my modem into my sick-bed with me. :-) ID: 2105 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2106 - Posted: 26 Feb 2016, 19:43:19 UTC - in response to Message 2105. Unfortunately, I didn't take my modem into my sick-bed with me. :-) I did not mean you.There are others, or not? ID: 2106 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2108 - Posted: 26 Feb 2016, 20:25:09 UTC - in response to Message 2101. [quote]What is the status? Both our CRAB3 Condor server and I myself seem to be back on our feet. Are we "go" again? Started a box up, the "stuck" job resumed, the next glidein run picked up a job (159 from the last batch) and all is sweetness and light once again. ID: 2108 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 2112 - Posted: 27 Feb 2016, 12:58:12 UTC I do see a potential problem -- condor_q says we have 92 jobs running, condor_status says there are 49 running. ID: 2112 · Rating: 0 · rate: / Reply Quote

Development for LHC@home