Thread 'Expect errors eventually'

Author	Message
PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0	Message 1721 - Posted: 29 Jan 2016, 22:13:48 UTC - in response to Message 1720. Shame on you ;-) Did anyone look into the logs for the upload I sat and watched and saw no attempts at a retry during the upload ? ID: 1721 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 1722 - Posted: 29 Jan 2016, 23:19:23 UTC - in response to Message 1721. Shame on you ;-) Did anyone look into the logs for the upload I sat and watched and saw no attempts at a retry during the upload ? Ah, sorry, I've been a bit distracted since then. It's still in my in-box, I said I'd look at it eventually! ID: 1722 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 1723 - Posted: 29 Jan 2016, 23:20:27 UTC Last modified: 29 Jan 2016, 23:28:37 UTC I am getting a lot of fails on my machine. Maybe the image got corrupted? As what do all the failed jobs, caused by the outage, show up in terms of exit code? Can the be re-run in the current batch? ID: 1723 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 1724 - Posted: 29 Jan 2016, 23:37:30 UTC - in response to Message 1723. I am getting a lot of fails on my machine. Maybe the image got corrupted? I think it's something more than that (or did you mean non-151 errors? A project reset will download a fresh VM for you.). I just did some figuring; I'd previously estimated the average upload speed for each of our jobs at 200 kbps. Now we've been seeing around 350 concurrent jobs today, so that's a low-ball guess of 70 Mbps required for our returns. I'd hope that all of CERN's connectivity was at least 1 Gbps these days, but if there's a 100 Mbps bottleneck to the data-bridge... From Dashboard, the median job time for the current batch is ~46 minutes. From the data-bridge, the average result is ~45 MB. 45 MB is 450 Mb, figuring in parity bits, etc. 450 Mb in 2760 secs is 163 kbps. ID: 1724 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 1726 - Posted: 29 Jan 2016, 23:46:08 UTC - in response to Message 1724. Would the upload not take a lot longer, if the connection to the server was choked? My uploads were as fast as they have been. ID: 1726 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 1727 - Posted: 30 Jan 2016, 0:45:32 UTC - in response to Message 1726. Would the upload not take a lot longer, if the connection to the server was choked? My uploads were as fast as they have been. Tja, I haven't had the time to analyse in that detail, and in truth I won't really have it for another week -- the events today seriously compromised my other responsibilities, I'm going to have to work over the weekend to try to make up time (one reason I'm not going to submit another batch of jobs yet). I try to avoid that, having burnt myself out before as a staff scientist at a large lab with too many responsibilities and not enough sleep. ID: 1727 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 1728 - Posted: 30 Jan 2016, 0:49:05 UTC - in response to Message 1727. That is perfectly OK. Time for bed. Good night. ID: 1728 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 1993 - Posted: 13 Feb 2016, 21:28:13 UTC Last modified: 13 Feb 2016, 21:34:08 UTC Another high failure host. Failed at least 8 out of the current 35. EDIT: make that 13+ out of 36. ID: 1993 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 1997 - Posted: 13 Feb 2016, 22:43:26 UTC - in response to Message 1993. Another high failure host. Failed at least 8 out of the current 35. EDIT: make that 13+ out of 36. I guess you mean == JOB AD: StartdPrincipal = "execute-side@matchsession/176.115.19.17" That's a vLHC host, so it's difficult for me to identify him for a PM, and harder still to send him a kneemail. ID: 1997 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 1999 - Posted: 13 Feb 2016, 22:49:43 UTC - in response to Message 1997. Hi Ivan, after my post the total failures dropped from 36 to 18. I guess, someone just reassigned some of jobs back to the queue? Sending a PM. ID: 1999 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 2000 - Posted: 13 Feb 2016, 22:56:20 UTC - in response to Message 1999. Hi Ivan, after my post the total failures dropped from 36 to 18. I guess, someone just reassigned some of jobs back to the queue? Yes, what Dashboard flags as a failure may well be treated as a retry by Condor; when it's sent again the Dashboard failure counter drops. At least that's my non-expert observation. Sending a PM. Thanks, I'm about to go to bed, so don't expect an answer tonight. :-) ID: 2000 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2008 - Posted: 15 Feb 2016, 18:28:04 UTC Last modified: 15 Feb 2016, 18:31:43 UTC Another runaway. 20 + Fails from the same IP address in the last 3h. THIS HAS TO STOP! ID: 2008 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1134 Credit: 339,231 RAC: 0	Message 2009 - Posted: 15 Feb 2016, 20:47:31 UTC - in response to Message 2008. Last modified: 15 Feb 2016, 20:48:08 UTC Since the beginning of February, our efficiency measured as wall clock time of successful jobs over total wall clock time is 85.5%. For comparison we would expect a Grid site have an efficiency of 95%. In the second link bellow you can see a pie chart with the errors. One by one we will have go through the most common failure modes and make improvements where we can. I will add addressing the runaway (blackhole) issue to the workplan. If you could email me the details, it would help me to get started. Thanks, Laurence http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=2016-02-01&end=2016-02-15&timeRange=daily&sortBy=0&granularity=Daily&generic=0&series=All&type=nwcp http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=2016-02-01&end=2016-02-15&timeRange=daily&sortBy=0&granularity=Daily&generic=0&series=All&type=aaap ID: 2009 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2011 - Posted: 15 Feb 2016, 21:43:22 UTC - in response to Message 2009. I do not have your e-mail address. I could PM it. ID: 2011 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2012 - Posted: 15 Feb 2016, 22:08:43 UTC I suggest, if any cms-boinc-task has more than 6? failed jobs,it should error out. Comments? ID: 2012 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2030 - Posted: 17 Feb 2016, 3:36:25 UTC We appear to have a problem. Large numbers of jobs stuck in WNPostProc on server. Some jobs have 2 WNPostProc running simultaneously. ID: 2030 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 2031 - Posted: 17 Feb 2016, 8:32:00 UTC - in response to Message 2030. We appear to have a problem. Large numbers of jobs stuck in WNPostProc on server. Some jobs have 2 WNPostProc running simultaneously. That could be because we've increased the Condor JobLeaseDuration from two hours to six days(!) to facilitate the job-resume capability (up to two days; apparently it waits for some parameter or other, or JLD/3, whichever is least, before closing the job). From past experience, I'd expected this to greatly increase the jobs-in-progress in the Jobs page as job's don't time-out in a wait state. Tho', I'm not sure why that would put them into PostProc... I don't believe I can look at the Condor logs until they are out of PP. If you happen to get one yourself, please report what you see in _condor_stdout. ID: 2031 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1230 Credit: 947,179 RAC: 437	Message 2033 - Posted: 17 Feb 2016, 10:13:20 UTC Last modified: 17 Feb 2016, 10:17:29 UTC I got a new task and now not able to get jobs. Run 1 lasted 22 minutes and run 2 gets the same error in glidein-stderr: Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 Wed Feb 17 10:55:03 CET 2016 Failed to load file 'description.g2h8Sm.cfg' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'. Wed Feb 17 10:55:03 CET 2016 Sleeping 259 Wed Feb 17 10:59:22 CET 2016 Sleeping 313 Wed Feb 17 11:04:36 CET 2016 Sleeping 291 Wed Feb 17 11:09:27 CET 2016 Sleeping 329 ID: 2033 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 2034 - Posted: 17 Feb 2016, 10:21:13 UTC - in response to Message 2033. Last modified: 17 Feb 2016, 10:26:21 UTC Same here. There are a number of processes that run each other.All have different timeouts. I think, it is very dangerous to drastically alter one timeout value. ID: 2034 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1134 Credit: 339,231 RAC: 0	Message 2035 - Posted: 17 Feb 2016, 10:49:00 UTC - in response to Message 2033. I think there is an issue with the glidein at that URL. ID: 2035 · Rating: 0 · rate: / Reply Quote

Development for LHC@home