Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 20 May 15 Posts: 217 Credit: 5,968,227 RAC: 0 |
Shame on you ;-) Did anyone look into the logs for the upload I sat and watched and saw no attempts at a retry during the upload ? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
Shame on you ;-) Ah, sorry, I've been a bit distracted since then. It's still in my in-box, I said I'd look at it eventually! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am getting a lot of fails on my machine. Maybe the image got corrupted? As what do all the failed jobs, caused by the outage, show up in terms of exit code? Can the be re-run in the current batch? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
I am getting a lot of fails on my machine. I think it's something more than that (or did you mean non-151 errors? A project reset will download a fresh VM for you.). I just did some figuring; I'd previously estimated the average upload speed for each of our jobs at 200 kbps*. Now we've been seeing around 350 concurrent jobs today, so that's a low-ball guess of 70 Mbps required for our returns. I'd hope that all of CERN's connectivity was at least 1 Gbps these days, but if there's a 100 Mbps bottleneck to the data-bridge... * From Dashboard, the median job time for the current batch is ~46 minutes. From the data-bridge, the average result is ~45 MB. 45 MB is 450 Mb, figuring in parity bits, etc. 450 Mb in 2760 secs is 163 kbps. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Would the upload not take a lot longer, if the connection to the server was choked? My uploads were as fast as they have been. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
Would the upload not take a lot longer, if the connection to the server was choked? Tja, I haven't had the time to analyse in that detail, and in truth I won't really have it for another week -- the events today seriously compromised my other responsibilities, I'm going to have to work over the weekend to try to make up time (one reason I'm not going to submit another batch of jobs yet). I try to avoid that, having burnt myself out before as a staff scientist at a large lab with too many responsibilities and not enough sleep. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
That is perfectly OK. Time for bed. Good night. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another high failure host. Failed at least 8 out of the current 35. EDIT: make that 13+ out of 36. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
Another high failure host. I guess you mean == JOB AD: StartdPrincipal = "execute-side@matchsession/176.115.19.17" That's a vLHC host, so it's difficult for me to identify him for a PM, and harder still to send him a kneemail. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Hi Ivan, after my post the total failures dropped from 36 to 18. I guess, someone just reassigned some of jobs back to the queue? Sending a PM. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
Hi Ivan, Yes, what Dashboard flags as a failure may well be treated as a retry by Condor; when it's sent again the Dashboard failure counter drops. At least that's my non-expert observation. Sending a PM. Thanks, I'm about to go to bed, so don't expect an answer tonight. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another runaway. 20 + Fails from the same IP address in the last 3h. THIS HAS TO STOP! |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Since the beginning of February, our efficiency measured as wall clock time of successful jobs over total wall clock time is 85.5%. For comparison we would expect a Grid site have an efficiency of 95%. In the second link bellow you can see a pie chart with the errors. One by one we will have go through the most common failure modes and make improvements where we can. I will add addressing the runaway (blackhole) issue to the workplan. If you could email me the details, it would help me to get started. Thanks, Laurence
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I do not have your e-mail address. I could PM it. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I suggest, if any cms-boinc-task has more than 6? failed jobs,it should error out. Comments? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
We appear to have a problem. Large numbers of jobs stuck in WNPostProc on server. Some jobs have 2 WNPostProc running simultaneously. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,155,005 RAC: 1,386 |
We appear to have a problem. That could be because we've increased the Condor JobLeaseDuration from two hours to six days(!) to facilitate the job-resume capability (up to two days; apparently it waits for some parameter or other, or JLD/3, whichever is least, before closing the job). From past experience, I'd expected this to greatly increase the jobs-in-progress in the Jobs page as job's don't time-out in a wait state. Tho', I'm not sure why that would put them into PostProc... I don't believe I can look at the Condor logs until they are out of PP. If you happen to get one yourself, please report what you see in _condor_stdout. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 81 |
I got a new task and now not able to get jobs. Run 1 lasted 22 minutes and run 2 gets the same error in glidein-stderr: Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 Wed Feb 17 10:55:03 CET 2016 Failed to load file 'description.g2h8Sm.cfg' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'. Wed Feb 17 10:55:03 CET 2016 Sleeping 259 Wed Feb 17 10:59:22 CET 2016 Sleeping 313 Wed Feb 17 11:04:36 CET 2016 Sleeping 291 Wed Feb 17 11:09:27 CET 2016 Sleeping 329 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Same here. There are a number of processes that run each other.All have different timeouts. I think, it is very dangerous to drastically alter one timeout value. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I think there is an issue with the glidein at that URL. |
©2024 CERN