Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4936 - Posted: 27 May 2017, 7:04:10 UTC - in response to Message 4935.  

Sorry, just woke up to find this. Can't see what the problem is yet, it doesn't appear to be the WMAgent this time. Could take a while to fix, it was a holiday at CERN on Thursday, so many people turn that into a four-day weekend.
ID: 4936 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4937 - Posted: 27 May 2017, 7:12:29 UTC - in response to Message 4936.  

Suspicious! The number of running jobs started falling right at 0000 UTC...
ID: 4937 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4940 - Posted: 27 May 2017, 8:47:19 UTC - in response to Message 4937.  

Some expired certificate,maybe?
Although the are usually not aligned with full day mark.
ID: 4940 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4941 - Posted: 27 May 2017, 12:08:36 UTC - in response to Message 4940.  

Turned out that the queue filled up with merge jobs -- Laurence is away and his little cluster filled up. Cleared now.
ID: 4941 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4942 - Posted: 27 May 2017, 12:36:05 UTC - in response to Message 4941.  

Great, thanks a lot!
I still have to wait until tomorrow, because of the used-up quota.
Please do (or ask someone) about this.
ID: 4942 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 4944 - Posted: 27 May 2017, 12:46:59 UTC - in response to Message 4942.  

Please do (or ask someone) about this.


Highest priority for next week.
ID: 4944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4953 - Posted: 29 May 2017, 18:26:43 UTC

Warning: I've just spotted that our WMAgent has problems again, and the queue is depleting. Best set No New Tasks until we see if it gets fixed. There were another two down when I first noticed it, but they are back again so maybe someone is already working on it (my monitors only update every five minutes).
ID: 4953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4954 - Posted: 29 May 2017, 21:04:22 UTC - in response to Message 4953.  

OK, White Knight Alan to the rescue again. Jobs available now. I need to sleep soon...
ID: 4954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4955 - Posted: 29 May 2017, 21:32:20 UTC

Right, we had an empty job queue from about 1939 UTC to 2039 UTC. Did anyone notice if Laurence's new task creation backoff worked in this period?
ID: 4955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 4957 - Posted: 30 May 2017, 9:42:33 UTC - in response to Message 4955.  

Did anyone notice if Laurence's new task creation backoff worked in this period?


It didn't but the issue has been identified and fixed.
ID: 4957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4981 - Posted: 11 Jun 2017, 7:07:17 UTC

The running jobs graph dropped quite a bit.
However, the error rate remains constant(maybe increased a little bit).
How can that be?
ID: 4981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4982 - Posted: 11 Jun 2017, 9:48:33 UTC - in response to Message 4981.  

The running jobs graph dropped quite a bit.
However, the error rate remains constant(maybe increased a little bit).
How can that be?

Well, we could have lost some heavy hitter, tho' 400 cores seems a bit much for a non-institutional user (I don't think Laurence is running that many these days, just enough for the merge jobs). Or maybe Laurence's fix is working and BOINC is backing off the time delay between job requests. As far as I can tell from here all my hosts are still working nominally, but Laurence did have a job that failed on Friday because the cvmfs configuration had failed.
ID: 4982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 4983 - Posted: 12 Jun 2017, 7:27:15 UTC - in response to Message 4981.  

Most of the failures are due to data transfer errors. As my cluster is running on the CERN network, the transfers are quite reliable.
ID: 4983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 4984 - Posted: 12 Jun 2017, 8:10:48 UTC

We seem to have recovered, starting around 2000 UTC last night. Strange...
ID: 4984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 5005 - Posted: 16 Jun 2017, 6:04:07 UTC

No jobs, again.
Quota is used up, again.
I thought, that should have been fixed.
ID: 5005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 5006 - Posted: 16 Jun 2017, 8:10:11 UTC - in response to Message 5005.  

No jobs, again.
Quota is used up, again.
I thought, that should have been fixed.

Yes, something up with WMAgent again -- it must be Friday! I've messaged Alan and Laurence.
ID: 5006 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 5008 - Posted: 16 Jun 2017, 8:30:57 UTC - in response to Message 5005.  

The mechanism is working but our buffer of jobs is too high for the dev project. I will lower it.
ID: 5008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 5013 - Posted: 19 Jun 2017, 0:20:25 UTC
Last modified: 19 Jun 2017, 0:23:13 UTC

Yep, another WMAgent failure. Don't expect a fix until CERN starts to wake up in 8 or 9 hours. Usual suspects have been notified...

Me? I'm going to bed soon, but it's way too hot to sleep well here, at the moment. I've had to throttle my main machine here to 50%, otherwise it overheats running SETI@Home.
ID: 5013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 5062 - Posted: 5 Aug 2017, 4:36:09 UTC

ID: 5062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,908,754
RAC: 2,311
Message 5063 - Posted: 5 Aug 2017, 7:46:17 UTC - in response to Message 5062.  

Weekend problem: Please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4393

OK, Alan was very quick to fix the problem, we are away again!
ID: 5063 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN