Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Suspicious! The number of running jobs started falling right at 0000 UTC... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Some expired certificate,maybe? Although the are usually not aligned with full day mark. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Turned out that the queue filled up with merge jobs -- Laurence is away and his little cluster filled up. Cleared now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Great, thanks a lot! I still have to wait until tomorrow, because of the used-up quota. Please do (or ask someone) about this. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Please do (or ask someone) about this. Highest priority for next week. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Warning: I've just spotted that our WMAgent has problems again, and the queue is depleting. Best set No New Tasks until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes). |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
OK, White Knight Alan to the rescue again. Jobs available now. I need to sleep soon... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Right, we had an empty job queue from about 1939 UTC to 2039 UTC. Did anyone notice if Laurence's new task creation backoff worked in this period? |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Did anyone notice if Laurence's new task creation backoff worked in this period? It didn't but the issue has been identified and fixed. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The running jobs graph dropped quite a bit. However, the error rate remains constant(maybe increased a little bit). How can that be? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
The running jobs graph dropped quite a bit. Well, we could have lost some heavy hitter, tho' 400 cores seems a bit much for a non-institutional user (I don't think Laurence is running that many these days, just enough for the merge jobs). Or maybe Laurence's fix is working and BOINC is backing off the time delay between job requests. As far as I can tell from here all my hosts are still working nominally, but Laurence did have a job that failed on Friday because the cvmfs configuration had failed. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Most of the failures are due to data transfer errors. As my cluster is running on the CERN network, the transfers are quite reliable. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
We seem to have recovered, starting around 2000 UTC last night. Strange... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No jobs, again. Quota is used up, again. I thought, that should have been fixed. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
No jobs, again. Yes, something up with WMAgent again -- it must be Friday! I've messaged Alan and Laurence. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The mechanism is working but our buffer of jobs is too high for the dev project. I will lower it. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Yep, another WMAgent failure. Don't expect a fix until CERN starts to wake up in 8 or 9 hours. Usual suspects have been notified... Me? I'm going to bed soon, but it's way too hot to sleep well here, at the moment. I've had to throttle my main machine here to 50%, otherwise it overheats running SETI@Home. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Weekend problem: Please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4393 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,251,944 RAC: 3,360 |
Weekend problem: Please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4393 OK, Alan was very quick to fix the problem, we are away again! |
©2024 CERN