Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

AuthorMessage
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,823,741
RAC: 17,217
Message 1721 - Posted: 29 Jan 2016, 22:13:48 UTC - in response to Message 1720.  

Shame on you ;-)

Did anyone look into the logs for the upload I sat and watched and saw no attempts at a retry during the upload ?
ID: 1721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1722 - Posted: 29 Jan 2016, 23:19:23 UTC - in response to Message 1721.  

Shame on you ;-)

Did anyone look into the logs for the upload I sat and watched and saw no attempts at a retry during the upload ?

Ah, sorry, I've been a bit distracted since then. It's still in my in-box, I said I'd look at it eventually!
ID: 1722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1723 - Posted: 29 Jan 2016, 23:20:27 UTC
Last modified: 29 Jan 2016, 23:28:37 UTC

I am getting a lot of fails on my machine.
Maybe the image got corrupted?

As what do all the failed jobs, caused by the outage, show up in terms of exit code?
Can the be re-run in the current batch?
ID: 1723 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1724 - Posted: 29 Jan 2016, 23:37:30 UTC - in response to Message 1723.  

I am getting a lot of fails on my machine.
Maybe the image got corrupted?

I think it's something more than that (or did you mean non-151 errors? A project reset will download a fresh VM for you.). I just did some figuring; I'd previously estimated the average upload speed for each of our jobs at 200 kbps*. Now we've been seeing around 350 concurrent jobs today, so that's a low-ball guess of 70 Mbps required for our returns. I'd hope that all of CERN's connectivity was at least 1 Gbps these days, but if there's a 100 Mbps bottleneck to the data-bridge...

* From Dashboard, the median job time for the current batch is ~46 minutes. From the data-bridge, the average result is ~45 MB. 45 MB is 450 Mb, figuring in parity bits, etc. 450 Mb in 2760 secs is 163 kbps.
ID: 1724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1726 - Posted: 29 Jan 2016, 23:46:08 UTC - in response to Message 1724.  

Would the upload not take a lot longer, if the connection to the server was choked?
My uploads were as fast as they have been.
ID: 1726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1727 - Posted: 30 Jan 2016, 0:45:32 UTC - in response to Message 1726.  

Would the upload not take a lot longer, if the connection to the server was choked?
My uploads were as fast as they have been.

Tja, I haven't had the time to analyse in that detail, and in truth I won't really have it for another week -- the events today seriously compromised my other responsibilities, I'm going to have to work over the weekend to try to make up time (one reason I'm not going to submit another batch of jobs yet). I try to avoid that, having burnt myself out before as a staff scientist at a large lab with too many responsibilities and not enough sleep.
ID: 1727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1728 - Posted: 30 Jan 2016, 0:49:05 UTC - in response to Message 1727.  

That is perfectly OK.
Time for bed.
Good night.
ID: 1728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1993 - Posted: 13 Feb 2016, 21:28:13 UTC
Last modified: 13 Feb 2016, 21:34:08 UTC

Another high failure host.
Failed at least 8 out of the current 35.

EDIT: make that 13+ out of 36.
ID: 1993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1997 - Posted: 13 Feb 2016, 22:43:26 UTC - in response to Message 1993.  

Another high failure host.
Failed at least 8 out of the current 35.

EDIT: make that 13+ out of 36.

I guess you mean
== JOB AD: StartdPrincipal = "execute-side@matchsession/176.115.19.17"
That's a vLHC host, so it's difficult for me to identify him for a PM, and harder still to send him a kneemail.
ID: 1997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1999 - Posted: 13 Feb 2016, 22:49:43 UTC - in response to Message 1997.  

Hi Ivan,
after my post the total failures dropped from 36 to 18.
I guess, someone just reassigned some of jobs back to the queue?
Sending a PM.
ID: 1999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2000 - Posted: 13 Feb 2016, 22:56:20 UTC - in response to Message 1999.  

Hi Ivan,
after my post the total failures dropped from 36 to 18.
I guess, someone just reassigned some of jobs back to the queue?

Yes, what Dashboard flags as a failure may well be treated as a retry by Condor;
when it's sent again the Dashboard failure counter drops. At least that's my non-expert observation.
Sending a PM.

Thanks, I'm about to go to bed, so don't expect an answer tonight. :-)
ID: 2000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2008 - Posted: 15 Feb 2016, 18:28:04 UTC
Last modified: 15 Feb 2016, 18:31:43 UTC

Another runaway.

20 + Fails from the same IP address in the last 3h.

THIS HAS TO STOP!
ID: 2008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2009 - Posted: 15 Feb 2016, 20:47:31 UTC - in response to Message 2008.  
Last modified: 15 Feb 2016, 20:48:08 UTC

Since the beginning of February, our efficiency measured as wall clock time of successful jobs over total wall clock time is 85.5%. For comparison we would expect a Grid site have an efficiency of 95%. In the second link bellow you can see a pie chart with the errors. One by one we will have go through the most common failure modes and make improvements where we can. I will add addressing the runaway (blackhole) issue to the workplan. If you could email me the details, it would help me to get started.

Thanks,

Laurence

ID: 2009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2011 - Posted: 15 Feb 2016, 21:43:22 UTC - in response to Message 2009.  

I do not have your e-mail address. I could PM it.
ID: 2011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2012 - Posted: 15 Feb 2016, 22:08:43 UTC

I suggest, if any cms-boinc-task has more than 6? failed jobs,it should error out.

Comments?
ID: 2012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2030 - Posted: 17 Feb 2016, 3:36:25 UTC

We appear to have a problem.
Large numbers of jobs stuck in WNPostProc on server.
Some jobs have 2 WNPostProc running simultaneously.
ID: 2030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2031 - Posted: 17 Feb 2016, 8:32:00 UTC - in response to Message 2030.  

We appear to have a problem.
Large numbers of jobs stuck in WNPostProc on server.
Some jobs have 2 WNPostProc running simultaneously.

That could be because we've increased the Condor JobLeaseDuration from two hours to six days(!) to facilitate the job-resume capability (up to two days; apparently it waits for some parameter or other, or JLD/3, whichever is least, before closing the job). From past experience, I'd expected this to greatly increase the jobs-in-progress in the Jobs page as job's don't time-out in a wait state. Tho', I'm not sure why that would put them into PostProc... I don't believe I can look at the Condor logs until they are out of PP. If you happen to get one yourself, please report what you see in _condor_stdout.
ID: 2031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2033 - Posted: 17 Feb 2016, 10:13:20 UTC
Last modified: 17 Feb 2016, 10:17:29 UTC

I got a new task and now not able to get jobs.
Run 1 lasted 22 minutes and run 2 gets the same error in glidein-stderr:

Setting X509_USER_PROXY to canonical path /tmp/x509up_u500
Wed Feb 17 10:55:03 CET 2016 Failed to load file 'description.g2h8Sm.cfg' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.
Wed Feb 17 10:55:03 CET 2016 Sleeping 259
Wed Feb 17 10:59:22 CET 2016 Sleeping 313
Wed Feb 17 11:04:36 CET 2016 Sleeping 291
Wed Feb 17 11:09:27 CET 2016 Sleeping 329
ID: 2033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2034 - Posted: 17 Feb 2016, 10:21:13 UTC - in response to Message 2033.  
Last modified: 17 Feb 2016, 10:26:21 UTC

Same here.

There are a number of processes that run each other.All have different timeouts.

I think, it is very dangerous to drastically alter one timeout value.
ID: 2034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2035 - Posted: 17 Feb 2016, 10:49:00 UTC - in response to Message 2033.  

I think there is an issue with the glidein at that URL.
ID: 2035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN