Message boards :
CMS Application :
host backoff
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 268 |
Ivan, does the above mean that this is your last day or that you are hoping for a new contract in the post ? Congratulations, hope the print isn't too small. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
Just the usual (but at least it's not in Comic Sans) -- and thankfully no mention of my passing National Retirement Age later this year (but that discrimination is outlawed here now :-). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another runaway. 10+ fails. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
Unfortunately I won't have access to check until around midnight tonight... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. We wil have to live with it. Numbers are still growing. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
Another runaway. 10+ fails. The only host I could find with that many fails had quite a few successful jobs too. The one failure I looked at was an exception of some kind after 180 or so events -- it may be running close to the system memory limit or similar. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There are two individual IP addresses, that produce a lot of fails. Mostly exit code 134. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
There are two individual IP addresses, that produce a lot of fails. Memory problems, by the look of it; access exceptions, etc. Both from vLHC project. PMs sent; one is in Japan so may not respond immediately due to time-zone differences. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
We are getting a number of code 10031 exit codes, all from the same IP address. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
We are getting a number of code 10031 exit codes, all from the same IP address. Some sort of glitch; one in the afternoon then 4 in quick succession in the night. -- Ah, same user I tried to contact yesterday; will send an e-mail... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There is , again, an increase in error rate. It is not that dramatic but , again, caused mainly by one or two IP addresses. If there was a system in place to eliminate the worst two or three hosts(IPaddresses), the error rate could be improved by 50% or more. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
There is , again, an increase in error rate.Are they WMAgent jobs? I don't see many failures in my CRAB3 jobs. There's one user failing with WMAgent because of an apparently misconfigured firewall but I don't think anyone's managed to get a response from him yet. If there was a system in place to eliminate the worst two or three hosts(IPaddresses), the error rate could be improved by 50% or more.I know it's on Laurence's to-do list, but so are 101 other things. :-/ |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It seems to have stabilized. I always get a bit nervous , if there are 5 or more fails from the same IP in a short period of time. I think, 0.5% error rate is about normal.Up to about 1%, there are usually some dodgy hosts and anything going over 1.5% indcates server issues. I guess, as you have an unlimited supply of jobs and a failed job gets reassigned, the only real issue is wasted computing time. Maybe, i just want the error stats to look as good as possible. EDIT: Are they WMAgent jobs? I don't think so. [url] http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/dailysummary#button=jobstatus&sites[]=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timerange=lastWeek&granularity=Hourly&generic=0&sortby=5&series=All&submissions[]=crab3&submissions[]=wmagent[/url] |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
Are they WMAgent jobs? There's a ticket out at the moment (or at least was yesterday) that WMAgent jobs weren't being reported properly. I can see 98 WMAGent Condor jobs running at RAL at present (alongside 102 of my CRAB jobs), and there are 10,000 in the queue. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. Even if there is (was) a problem, it would not cause individual IP addresses to cause more fails, would it? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,541 RAC: 270 |
Thanks, Ivan. No, but we don't seem to be getting the full picture of what is running. |
©2024 CERN