Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5080 - Posted: 16 Aug 2017, 8:08:04 UTC

Don't Panic! The current dip in the Job Activities graph appears to be a Dashboard problem. All my other monitors show business as usual.
ID: 5080 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 668
Credit: 1,810,226
RAC: 2,276
Message 5081 - Posted: 16 Aug 2017, 9:10:36 UTC

The -dev Server-feeder is not running at the moment. so we have to wait....
ID: 5081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5082 - Posted: 16 Aug 2017, 15:22:22 UTC - in response to Message 5081.  

The -dev Server-feeder is not running at the moment. so we have to wait....

Yes, I just noticed that and messaged the CERN team. The Dashboard plots are back up again, though.
ID: 5082 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5083 - Posted: 16 Aug 2017, 15:48:44 UTC - in response to Message 5082.  

The -dev Server-feeder is not running at the moment. so we have to wait....

Yes, I just noticed that and messaged the CERN team. The Dashboard plots are back up again, though.

Seems there are still some ongoing problems with the transfer to a CentOS 7 host.
ID: 5083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 5084 - Posted: 16 Aug 2017, 16:03:50 UTC - in response to Message 5083.  

Still cannot report finished boinc-task or get a new one.

"Server error:Feeder not running"
ID: 5084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5089 - Posted: 17 Aug 2017, 6:53:38 UTC - in response to Message 5084.  
Last modified: 17 Aug 2017, 6:54:06 UTC

Still cannot report finished boinc-task or get a new one.

"Server error:Feeder not running"

Try it now; I just got two new tasks. Seems it's been fixed in the last few minutes.
ID: 5089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 5091 - Posted: 17 Aug 2017, 11:00:37 UTC - in response to Message 5089.  

Appears to be working...

Thanks!
ID: 5091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 668
Credit: 1,810,226
RAC: 2,276
Message 5139 - Posted: 18 Sep 2017, 13:11:35 UTC
Last modified: 18 Sep 2017, 13:12:04 UTC

The wallclock is showing a lot of red at the moment for CMS-jobs.
ID: 5139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5140 - Posted: 18 Sep 2017, 13:43:22 UTC - in response to Message 5139.  

The wallclock is showing a lot of red at the moment for CMS-jobs.

Hmm, yes. I submitted a new proxy this morning so it shouldn't be that, unless if something went wrong. I did just notice that a new batch of jobs I submitted this morning didn't actually succeed; some server glitch by the looks of it. It would have been roughly around the time those jobs started failing. A resubmission appears to have been successful.
According to WMStats there are still ~44 hours' worth of jobs to run. I'll see if I can work out what type of jobs are failing.
ID: 5140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5141 - Posted: 18 Sep 2017, 13:51:09 UTC - in response to Message 5140.  

task | status | site | exit code | jobs | error mesage
Production | jobfailed | T3_CH_Volunteer | 99109 | 4580 | LogArchiveFailure,Misc. StageOut error: 99109
Production | jobfailed | T3_CH_Volunteer | 99109 | 1492 | Misc. StageOut error: 99109


Maybe something did go wrong with the proxy creation. I'll do it again.
ID: 5141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5142 - Posted: 18 Sep 2017, 16:42:38 UTC - in response to Message 5141.  

Apparently there was a big upgrade today -- which might have been why my first attempt at a new batch of jobs failed -- and it disrupted the servers. Things have been restarted and look to be in the green again.
ID: 5142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5143 - Posted: 18 Sep 2017, 21:31:58 UTC - in response to Message 5142.  
Last modified: 18 Sep 2017, 21:40:42 UTC

Uh, oh, something's gone amiss again...

Something seems wrong with the squid proxy.
ID: 5143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5144 - Posted: 18 Sep 2017, 22:07:36 UTC - in response to Message 5143.  
Last modified: 18 Sep 2017, 22:08:49 UTC

I thought initially that it was more widespread, but it looks now like it's just the CMS project.
[Edit] Hang on, we might be recovering! [/Edit]
ID: 5144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5145 - Posted: 18 Sep 2017, 22:18:05 UTC - in response to Message 5144.  

...and now it's dropping away again. I need to go to bed soon, so I can't keep monitoring. :-(
ID: 5145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,879,942
RAC: 562
Message 5146 - Posted: 19 Sep 2017, 8:20:15 UTC - in response to Message 5145.  
Last modified: 19 Sep 2017, 12:32:13 UTC

Things came back about 90 mins later. Indications are that it was a network glitch so that CMS jobs could not get through to the squid proxy to contact the various servers, so they quit with an exception. This explains the high number of job failures, I think, although there seemed to be Condor errors too.
[Later] I found the performance graphs for all the squids used by CMS; most of them were completely inactive during the period that we were having difficulties so it was a world-wide disruption. [/Later]
ID: 5146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9

Message boards : CMS Application : Dip?


©2024 CERN