Message boards : Number crunching : Expect errors eventually
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. 2 of the 3 "unknown" from the last batch are also from this host. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 71 ![]() ![]() |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. That host has done 179 jobs from current batch so far, so can't be that slow ;) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 71 ![]() ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. You're right. It must be several machines on the same IP. The 6 slowest Wall Times in minutes from that IP are: 217.88m 228.48m 278.67m 295.05m 296.43m 348.57m You have to consider that wall time is counting on even when a task a suspended or throttled. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
You have to consider that wall time is counting on even when a task a suspended or throttled. True, but being that slow is causing some kind of problem(timeout?). No matter, for what reason. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. He has... but these failures were all from the same machine and only from a task that started early this morning. There were 9 good runs earlier in this batch (~2 hrs each). There haven't been any failures since 0534 so maybe it was transient. PM sent. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
You have to consider that wall time is counting on even when a task a suspended or throttled. No, it's a code/disk/(network?) problem: == CMSSW: Begin processing the 9th record. Run 1, Event 211259, LumiSection 2536 at 16-Mar-2016 06:20:30.796 CET == CMSSW: R__unzip: error -5 in inflate (zlib) == CMSSW: ----- Begin Fatal Exception 16-Mar-2016 06:20:37 CET----------------------- == CMSSW: An exception of category 'FatalRootError' occurred while == CMSSW: [0] Processing run: 1 lumi: 2536 event: 211259 == CMSSW: [1] Running path 'simulation_step' == CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits' == CMSSW: Additional Info: == CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers == CMSSW: fNbytes = 7216683, fKeylen = 109, fObjlen = 10165998, noutot = 0, nout=0, nin=7216574, nbuf=10165998 == CMSSW: == CMSSW: ----- End Fatal Exception -------------------------------------------- I've sent a PM suggesting a project reset. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
Failures have shot up dramatically! Since about 15.20UTC. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 71 ![]() ![]() |
Failures have shot up dramatically! From several IP-addresses. Most with error code 8001 and state pending about 20 minutes after start time. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
It seems a sever issue. My jobs are stuck in WNPostProc. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
Did the proxy lease expire? Or something like that?Maybe the credentials are expired? EDIT: exactly 14.35UTC all went wrong.All jobs in postproc or failing. It has been pretty much exactly 2.5days(60h), since the batch started until this happened.Does that help? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1) This was changed 2 days ago. I suggest to set it back to 1200sec (default). At least, for a little while, to see, if things change. Unless, you have a better idea. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
Did the proxy lease expire? Or something like that?Maybe the credentials are expired? Proxy is from 15/3, should be good for nearly another 5 days or the jobs wouldn't start, I think. Do your stuck jobs have normal log files at your end? [Edit] Ah! I missed that the VM was rebooted again today because of the nss bug -- at 14:47UTC. So it might be the proxy, ISTR we had proxy problems last time it rebooted. I've copied the proxy from tonight's batch into the older one's area. [/Edit] ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
Do your stuck jobs have normal log files at your end? Sorry, can't help you there. Maybe one of the others can. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
One successful job in the new batch! |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 6 ![]() |
I started a new task about 20min ago. I am still concerned about the Claim_Worklife. There was no noticeable beneficial effect, since it was changed, or was there? It may cause some undesired effects if set to -1 (indefinite). If there is an unlimited claim to one batch, how is that treated, when a new batch is started? Can one host have multiple claims open? Is there some higher order timeout, that counteracts an indefinite duration claim? |
©2025 CERN