Dip?

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5080 - Posted: 16 Aug 2017, 8:08:04 UTC Don't Panic! The current dip in the Job Activities graph appears to be a Dashboard problem. All my other monitors show business as usual. ID: 5080 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 668 Credit: 1,810,226 RAC: 2,276	Message 5081 - Posted: 16 Aug 2017, 9:10:36 UTC The -dev Server-feeder is not running at the moment. so we have to wait.... ID: 5081 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5082 - Posted: 16 Aug 2017, 15:22:22 UTC - in response to Message 5081. The -dev Server-feeder is not running at the moment. so we have to wait.... Yes, I just noticed that and messaged the CERN team. The Dashboard plots are back up again, though. ID: 5082 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5083 - Posted: 16 Aug 2017, 15:48:44 UTC - in response to Message 5082. The -dev Server-feeder is not running at the moment. so we have to wait.... Yes, I just noticed that and messaged the CERN team. The Dashboard plots are back up again, though. Seems there are still some ongoing problems with the transfer to a CentOS 7 host. ID: 5083 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 5084 - Posted: 16 Aug 2017, 16:03:50 UTC - in response to Message 5083. Still cannot report finished boinc-task or get a new one. "Server error:Feeder not running" ID: 5084 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5089 - Posted: 17 Aug 2017, 6:53:38 UTC - in response to Message 5084. Last modified: 17 Aug 2017, 6:54:06 UTC Still cannot report finished boinc-task or get a new one. "Server error:Feeder not running" Try it now; I just got two new tasks. Seems it's been fixed in the last few minutes. ID: 5089 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 5091 - Posted: 17 Aug 2017, 11:00:37 UTC - in response to Message 5089. Appears to be working... Thanks! ID: 5091 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 668 Credit: 1,810,226 RAC: 2,276	Message 5139 - Posted: 18 Sep 2017, 13:11:35 UTC Last modified: 18 Sep 2017, 13:12:04 UTC The wallclock is showing a lot of red at the moment for CMS-jobs. ID: 5139 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5140 - Posted: 18 Sep 2017, 13:43:22 UTC - in response to Message 5139. The wallclock is showing a lot of red at the moment for CMS-jobs. Hmm, yes. I submitted a new proxy this morning so it shouldn't be that, unless if something went wrong. I did just notice that a new batch of jobs I submitted this morning didn't actually succeed; some server glitch by the looks of it. It would have been roughly around the time those jobs started failing. A resubmission appears to have been successful. According to WMStats there are still ~44 hours' worth of jobs to run. I'll see if I can work out what type of jobs are failing. ID: 5140 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5141 - Posted: 18 Sep 2017, 13:51:09 UTC - in response to Message 5140. task \| status \| site \| exit code \| jobs \| error mesage Production \| jobfailed \| T3_CH_Volunteer \| 99109 \| 4580 \| LogArchiveFailure,Misc. StageOut error: 99109 Production \| jobfailed \| T3_CH_Volunteer \| 99109 \| 1492 \| Misc. StageOut error: 99109 Maybe something did go wrong with the proxy creation. I'll do it again. ID: 5141 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5142 - Posted: 18 Sep 2017, 16:42:38 UTC - in response to Message 5141. Apparently there was a big upgrade today -- which might have been why my first attempt at a new batch of jobs failed -- and it disrupted the servers. Things have been restarted and look to be in the green again. ID: 5142 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5143 - Posted: 18 Sep 2017, 21:31:58 UTC - in response to Message 5142. Last modified: 18 Sep 2017, 21:40:42 UTC Uh, oh, something's gone amiss again... Something seems wrong with the squid proxy. ID: 5143 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5144 - Posted: 18 Sep 2017, 22:07:36 UTC - in response to Message 5143. Last modified: 18 Sep 2017, 22:08:49 UTC I thought initially that it was more widespread, but it looks now like it's just the CMS project. [Edit] Hang on, we might be recovering! [/Edit] ID: 5144 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5145 - Posted: 18 Sep 2017, 22:18:05 UTC - in response to Message 5144. ...and now it's dropping away again. I need to go to bed soon, so I can't keep monitoring. :-( ID: 5145 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 5146 - Posted: 19 Sep 2017, 8:20:15 UTC - in response to Message 5145. Last modified: 19 Sep 2017, 12:32:13 UTC Things came back about 90 mins later. Indications are that it was a network glitch so that CMS jobs could not get through to the squid proxy to contact the various servers, so they quit with an exception. This explains the high number of job failures, I think, although there seemed to be Condor errors too. [Later] I found the performance graphs for all the squids used by CMS; most of them were completely inactive during the period that we were having difficulties so it was a world-wide disruption. [/Later] ID: 5146 · Rating: 0 · rate: / Reply Quote

Development for LHC@home