Message boards : CMS Application : host backoff
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2469 - Posted: 21 Mar 2016, 11:20:29 UTC

There is one host(IP address) that is responsible for almost all of the current fails.(18)
Apparently, this issue has not been addressed, as it keeps happening again and again.
ID: 2469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2479 - Posted: 21 Mar 2016, 14:48:32 UTC - in response to Message 2469.  

Thanks for the notification. Yes, this needs improving. As we are moving to 12 hour glidein, we need to do the checks in between each job rather than each glidein. Added to the workplan

http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=125&postid=2478

The positive thing is that we have a good handle to debug failures now.

http://lhcathome2.cern.ch/vLHCathome/result.php?resultid=5507273

Short exit status: 66 seems to be an issue contacting the squid proxy/frontier.
ID: 2479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2482 - Posted: 21 Mar 2016, 15:04:19 UTC - in response to Message 2479.  

Are there any plans to error out the boinc task if there are more than ,let's say 50%,of jobs in one boinc task with an error?
I think, this needs to be implemented.
ID: 2482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2498 - Posted: 22 Mar 2016, 11:53:08 UTC - in response to Message 2469.  

There is one host(IP address) that is responsible for almost all of the current fails.(18)
Apparently, this issue has not been addressed, as it keeps happening again and again.

Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-)
ID: 2498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2508 - Posted: 22 Mar 2016, 22:10:49 UTC - in response to Message 2498.  

Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-)


Well, he is still happily producing fails.
ID: 2508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2509 - Posted: 22 Mar 2016, 22:21:06 UTC - in response to Message 2508.  

Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-)


Well, he is still happily producing fails.

We might be looking at a different user, then. Today I had a PM back from mine saying that he'd stopped to investigate and clean out dustbunnies. Send me a name, ID, or IP address.
ID: 2509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2510 - Posted: 22 Mar 2016, 22:34:45 UTC - in response to Message 2509.  

Done.
Thanks for looking into it, but there has to be a better way.
ID: 2510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2516 - Posted: 22 Mar 2016, 22:59:16 UTC - in response to Message 2510.  

It is this host.

http://lhcathome2.cern.ch/vLHCathome/results.php?hostid=91467

We have to just chase up these things and try to make the error detection better so handling can be automated. But they will be addressed according the priorities. The plots on the job stats page can help here. Here is the todo list.

http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=125&postid=2478#2478
ID: 2516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2517 - Posted: 22 Mar 2016, 23:36:42 UTC - in response to Message 2516.  

Ah, OK, I just e-mailed you... I'll send him a PM.
ID: 2517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2518 - Posted: 22 Mar 2016, 23:43:58 UTC - in response to Message 2517.  

Thanks, Ivan.
ID: 2518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2556 - Posted: 24 Mar 2016, 15:51:55 UTC
Last modified: 24 Mar 2016, 15:52:19 UTC

Yes, I PM'ed him. The problem went away, but not sure if that was spontaneous or not. :-)



Well, he is still happily producing fails.


And continues and continues......
ID: 2556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2557 - Posted: 24 Mar 2016, 18:22:53 UTC - in response to Message 2556.  

I got a reply, he's "working on it".
ID: 2557 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2560 - Posted: 24 Mar 2016, 22:04:33 UTC - in response to Message 2557.  
Last modified: 24 Mar 2016, 22:06:25 UTC

I got a reply, he's "working on it".

Sorry guys, I'm having trouble reconciling four different sources of information here:
a) Dashboard's logging of failures (publicly available)
b) vLHCathome's task list (which you can access since his co-ordinates were revealed last night, not sure if you can find that info by yourselves)
c) the Condor server's logs (which you can't access)
d) private e-mail communications (which you also can't access).

Particularly there are differences between a) and b) which suggest something's changed, but not for the better.

I'm not, at the moment, prepared to take an executive decision to ban the Volunteer (in fact my privileges don't extend that far), nor am I feeling particularly empowered to offer him "stern words" to cease. I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break.

* The contract for which only legally exists until the end of this month, anyway... I'm currently asking Mr Postman to "d'liver d'letter, d'sooner d'better!"
ID: 2560 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2561 - Posted: 24 Mar 2016, 22:36:05 UTC

Thanks, Ivan.

This is why erroring out the boinc tasks is very important.
This is the ONLY reliable way to notify the user.
ID: 2561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2562 - Posted: 24 Mar 2016, 23:28:39 UTC - in response to Message 2561.  

I think I agree with you on that, but don't quote me... :-/
ID: 2562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2563 - Posted: 24 Mar 2016, 23:44:38 UTC - in response to Message 2562.  
Last modified: 24 Mar 2016, 23:52:07 UTC

Agreed, CVMFS operation needs to be checked before the glidein even starts and the task aborted if it fails.

EDIT: Added to the workplan
ID: 2563 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 215
Credit: 2,113,702
RAC: 52
Message 2579 - Posted: 31 Mar 2016, 10:52:18 UTC - in response to Message 2560.  

... I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break.

* The contract for which only legally exists until the end of this month, anyway... I'm currently asking Mr Postman to "d'liver d'letter, d'sooner d'better!"


Ivan, does the above mean that this is your last day or that you are hoping for a new contract in the post ?
ID: 2579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,201,500
RAC: 0
Message 2593 - Posted: 6 Apr 2016, 20:19:23 UTC
Last modified: 6 Apr 2016, 20:27:11 UTC

Are there any plans to error out the boinc task if there are more than ,let's say 50%,of jobs in one boinc task with an error?
I think, this needs to be implemented.



However, if the first job fails, that does not mean that the task needs to be shut down. A minimum of 3 jobs should be concidered.

(WMAgent job failed with exit code 0?)

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=136182

EDIT: No failed job reported for wma jobs!
ID: 2593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2594 - Posted: 6 Apr 2016, 21:39:23 UTC - in response to Message 2579.  

... I feel that has to come from someone "above my pay-grade*", which may be difficult to arrange over the Easter break.

* The contract for which only legally exists until the end of this month, anyway... I'm currently asking Mr Postman to "d'liver d'letter, d'sooner d'better!"


Ivan, does the above mean that this is your last day or that you are hoping for a new contract in the post ?

Yes. I've now had a .pdf by e-mail, but am still waiting for the paper copy to sign. So barring accidents, you've probably got me for another year...
ID: 2594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2595 - Posted: 6 Apr 2016, 21:44:06 UTC - in response to Message 2593.  

[(WMAgent job failed with exit code 0?)

I believe I've seen that sort of thing before; possibly some other condition doesn't make it into the final report (maybe stage-out failed, but the wrapper is reporting the actual job completion code, for example), but that's for the WMAgent people to explain, I guess.
ID: 2595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : host backoff


©2020 CERN