Message boards :
CMS Application :
host backoff
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There is one host(IP address) that is responsible for almost all of the current fails.(18) Apparently, this issue has not been addressed, as it keeps happening again and again. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
Thanks for the notification. Yes, this needs improving. As we are moving to 12 hour glidein, we need to do the checks in between each job rather than each glidein. Added to the workplan http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=125&postid=2478 The positive thing is that we have a good handle to debug failures now. http://lhcathome2.cern.ch/vLHCathome/result.php?resultid=5507273 Short exit status: 66 seems to be an issue contacting the squid proxy/frontier. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Are there any plans to error out the boinc task if there are more than ,let's say 50%,of jobs in one boinc task with an error? I think, this needs to be implemented. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
There is one host(IP address) that is responsible for almost all of the current fails.(18) Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-) Well, he is still happily producing fails. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-) We might be looking at a different user, then. Today I had a PM back from mine saying that he'd stopped to investigate and clean out dustbunnies. Send me a name, ID, or IP address. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Done. Thanks for looking into it, but there has to be a better way. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
It is this host. http://lhcathome2.cern.ch/vLHCathome/results.php?hostid=91467 We have to just chase up these things and try to make the error detection better so handling can be automated. But they will be addressed according the priorities. The plots on the job stats page can help here. Here is the todo list. http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=125&postid=2478#2478 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
Ah, OK, I just e-mailed you... I'll send him a PM. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-) And continues and continues...... |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
I got a reply, he's "working on it". |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
I got a reply, he's "working on it". Sorry guys, I'm having trouble reconciling four different sources of information here: a) Dashboard's logging of failures (publicly available) b) vLHCathome's task list (which you can access since his co-ordinates were revealed last night, not sure if you can find that info by yourselves) c) the Condor server's logs (which you can't access) d) private e-mail communications (which you also can't access). Particularly there are differences between a) and b) which suggest something's changed, but not for the better. I'm not, at the moment, prepared to take an executive decision to ban the Volunteer (in fact my privileges don't extend that far), nor am I feeling particularly empowered to offer him "stern words" to cease. I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break. * The contract for which only legally exists until the end of this month, anyway... I'm currently asking Mr Postman to "d'liver d'letter, d'sooner d'better!" |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. This is why erroring out the boinc tasks is very important. This is the ONLY reliable way to notify the user. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
I think I agree with you on that, but don't quote me... :-/ |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
Agreed, CVMFS operation needs to be checked before the glidein even starts and the task aborted if it fails. EDIT: Added to the workplan |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,876,910 RAC: 16,233 |
... I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break. Ivan, does the above mean that this is your last day or that you are hoping for a new contract in the post ? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Are there any plans to error out the boinc task if there are more than ,let's say 50%,of jobs in one boinc task with an error? However, if the first job fails, that does not mean that the task needs to be shut down. A minimum of 3 jobs should be concidered. (WMAgent job failed with exit code 0?) http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=136182 EDIT: No failed job reported for wma jobs! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
... I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break. Yes. I've now had a .pdf by e-mail, but am still waiting for the paper copy to sign. So barring accidents, you've probably got me for another year... |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,947,328 RAC: 2,949 |
[(WMAgent job failed with exit code 0?) I believe I've seen that sort of thing before; possibly some other condition doesn't make it into the final report (maybe stage-out failed, but the wrapper is reporting the actual job completion code, for example), but that's for the WMAgent people to explain, I guess. |
©2024 CERN