Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Don't Panic! The current dip in the Job Activities graph appears to be a Dashboard problem. All my other monitors show business as usual. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
The -dev Server-feeder is not running at the moment. so we have to wait.... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
The -dev Server-feeder is not running at the moment. so we have to wait.... Yes, I just noticed that and messaged the CERN team. The Dashboard plots are back up again, though. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
The -dev Server-feeder is not running at the moment. so we have to wait.... Seems there are still some ongoing problems with the transfer to a CentOS 7 host. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Still cannot report finished boinc-task or get a new one. "Server error:Feeder not running" |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Still cannot report finished boinc-task or get a new one. Try it now; I just got two new tasks. Seems it's been fixed in the last few minutes. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Appears to be working... Thanks! |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
The wallclock is showing a lot of red at the moment for CMS-jobs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
The wallclock is showing a lot of red at the moment for CMS-jobs. Hmm, yes. I submitted a new proxy this morning so it shouldn't be that, unless if something went wrong. I did just notice that a new batch of jobs I submitted this morning didn't actually succeed; some server glitch by the looks of it. It would have been roughly around the time those jobs started failing. A resubmission appears to have been successful. According to WMStats there are still ~44 hours' worth of jobs to run. I'll see if I can work out what type of jobs are failing. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
task | status | site | exit code | jobs | error mesage Production | jobfailed | T3_CH_Volunteer | 99109 | 4580 | LogArchiveFailure,Misc. StageOut error: 99109 Production | jobfailed | T3_CH_Volunteer | 99109 | 1492 | Misc. StageOut error: 99109 Maybe something did go wrong with the proxy creation. I'll do it again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Apparently there was a big upgrade today -- which might have been why my first attempt at a new batch of jobs failed -- and it disrupted the servers. Things have been restarted and look to be in the green again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
I thought initially that it was more widespread, but it looks now like it's just the CMS project. [Edit] Hang on, we might be recovering! [/Edit] |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
...and now it's dropping away again. I need to go to bed soon, so I can't keep monitoring. :-( |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Things came back about 90 mins later. Indications are that it was a network glitch so that CMS jobs could not get through to the squid proxy to contact the various servers, so they quit with an exception. This explains the high number of job failures, I think, although there seemed to be Condor errors too. [Later] I found the performance graphs for all the squids used by CMS; most of them were completely inactive during the period that we were having difficulties so it was a world-wide disruption. [/Later] |
©2025 CERN