Message boards : CMS Application : host backoff
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 874,518
RAC: 399
Message 2598 - Posted: 7 Apr 2016, 10:41:21 UTC - in response to Message 2594.  

Ivan, does the above mean that this is your last day or that you are hoping for a new contract in the post ?

Yes. I've now had a .pdf by e-mail, but am still waiting for the paper copy to sign. So barring accidents, you've probably got me for another year...

Congratulations, hope the print isn't too small.
ID: 2598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2602 - Posted: 8 Apr 2016, 8:05:00 UTC - in response to Message 2598.  

Just the usual (but at least it's not in Comic Sans) -- and thankfully no mention of my passing National Retirement Age later this year (but that discrimination is outlawed here now :-).
ID: 2602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2697 - Posted: 13 Apr 2016, 1:08:39 UTC
Last modified: 13 Apr 2016, 1:12:56 UTC

Another runaway. 10+ fails.
ID: 2697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2709 - Posted: 13 Apr 2016, 10:08:30 UTC - in response to Message 2697.  

Unfortunately I won't have access to check until around midnight tonight...
ID: 2709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2710 - Posted: 13 Apr 2016, 10:16:00 UTC - in response to Message 2709.  

Thanks, Ivan.
We wil have to live with it.
Numbers are still growing.
ID: 2710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2758 - Posted: 14 Apr 2016, 15:41:35 UTC - in response to Message 2697.  
Last modified: 14 Apr 2016, 15:41:48 UTC

Another runaway. 10+ fails.

The only host I could find with that many fails had quite a few successful jobs too. The one failure I looked at was an exception of some kind after 180 or so events -- it may be running close to the system memory limit or similar.
ID: 2758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2889 - Posted: 21 Apr 2016, 11:01:57 UTC

There are two individual IP addresses, that produce a lot of fails.
Mostly exit code 134.
ID: 2889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2895 - Posted: 21 Apr 2016, 13:35:23 UTC - in response to Message 2889.  

There are two individual IP addresses, that produce a lot of fails.
Mostly exit code 134.

Memory problems, by the look of it; access exceptions, etc. Both from vLHC project. PMs sent; one is in Japan so may not respond immediately due to time-zone differences.
ID: 2895 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2896 - Posted: 21 Apr 2016, 13:39:52 UTC

Thanks, Ivan.
ID: 2896 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2918 - Posted: 21 Apr 2016, 21:40:00 UTC

We are getting a number of code 10031 exit codes, all from the same IP address.
ID: 2918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2930 - Posted: 22 Apr 2016, 10:13:40 UTC - in response to Message 2918.  

We are getting a number of code 10031 exit codes, all from the same IP address.

Some sort of glitch; one in the afternoon then 4 in quick succession in the night. -- Ah, same user I tried to contact yesterday; will send an e-mail...
ID: 2930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3253 - Posted: 4 May 2016, 20:28:37 UTC

There is , again, an increase in error rate.
It is not that dramatic but , again, caused mainly by one or two IP addresses.

If there was a system in place to eliminate the worst two or three hosts(IPaddresses), the error rate could be improved by 50% or more.
ID: 3253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3256 - Posted: 5 May 2016, 9:06:12 UTC - in response to Message 3253.  
Last modified: 5 May 2016, 9:08:09 UTC

There is , again, an increase in error rate.
It is not that dramatic but , again, caused mainly by one or two IP addresses.
Are they WMAgent jobs? I don't see many failures in my CRAB3 jobs. There's one user failing with WMAgent because of an apparently misconfigured firewall but I don't think anyone's managed to get a response from him yet.

If there was a system in place to eliminate the worst two or three hosts(IPaddresses), the error rate could be improved by 50% or more.
I know it's on Laurence's to-do list, but so are 101 other things. :-/
ID: 3256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3257 - Posted: 5 May 2016, 9:24:11 UTC - in response to Message 3256.  
Last modified: 5 May 2016, 9:27:08 UTC

It seems to have stabilized.
I always get a bit nervous , if there are 5 or more fails from the same IP in a short period of time.

I think, 0.5% error rate is about normal.Up to about 1%, there are usually some dodgy hosts and anything going over 1.5% indcates server issues.

I guess, as you have an unlimited supply of jobs and a failed job gets reassigned, the only real issue is wasted computing time.

Maybe, i just want the error stats to look as good as possible.

EDIT:
Are they WMAgent jobs?


I don't think so.
[url]
http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=jobstatus&sites[]=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timerange=lastWeek&granularity=Hourly&generic=0&sortby=5&series=All&submissions[]=crab3&submissions[]=wmagent[/url]
ID: 3257 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3262 - Posted: 5 May 2016, 12:01:42 UTC - in response to Message 3257.  

Are they WMAgent jobs?


I don't think so.
[url]
http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=jobstatus&sites[]=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timerange=lastWeek&granularity=Hourly&generic=0&sortby=5&series=All&submissions[]=crab3&submissions[]=wmagent[/url]

There's a ticket out at the moment (or at least was yesterday) that WMAgent jobs weren't being reported properly. I can see 98 WMAGent Condor jobs running at RAL at present (alongside 102 of my CRAB jobs), and there are 10,000 in the queue.
ID: 3262 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3263 - Posted: 5 May 2016, 12:15:13 UTC

Thanks, Ivan.
Even if there is (was) a problem, it would not cause individual IP addresses to cause more fails, would it?
ID: 3263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3266 - Posted: 5 May 2016, 13:10:09 UTC - in response to Message 3263.  

Thanks, Ivan.
Even if there is (was) a problem, it would not cause individual IP addresses to cause more fails, would it?

No, but we don't seem to be getting the full picture of what is running.
ID: 3266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : host backoff


©2024 CERN