Message boards : Number crunching : Expect errors eventually
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. "Fatal ROOT error". Must remember to check the logs when I get to work in a few minutes. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. 2 of the 3 "unknown" from the last batch are also from this host. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. That host has done 179 jobs from current batch so far, so can't be that slow ;) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. You're right. It must be several machines on the same IP. The 6 slowest Wall Times in minutes from that IP are: 217.88m 228.48m 278.67m 295.05m 296.43m 348.57m You have to consider that wall time is counting on even when a task a suspended or throttled. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
You have to consider that wall time is counting on even when a task a suspended or throttled. True, but being that slow is causing some kind of problem(timeout?). No matter, for what reason. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Maybe he has several machines, one is the slowest.Job 19 walltime >7h. He has... but these failures were all from the same machine and only from a task that started early this morning. There were 9 good runs earlier in this batch (~2 hrs each). There haven't been any failures since 0534 so maybe it was transient. PM sent. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
You have to consider that wall time is counting on even when a task a suspended or throttled. No, it's a code/disk/(network?) problem: == CMSSW: Begin processing the 9th record. Run 1, Event 211259, LumiSection 2536 at 16-Mar-2016 06:20:30.796 CET == CMSSW: R__unzip: error -5 in inflate (zlib) == CMSSW: ----- Begin Fatal Exception 16-Mar-2016 06:20:37 CET----------------------- == CMSSW: An exception of category 'FatalRootError' occurred while == CMSSW: [0] Processing run: 1 lumi: 2536 event: 211259 == CMSSW: [1] Running path 'simulation_step' == CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits' == CMSSW: Additional Info: == CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers == CMSSW: fNbytes = 7216683, fKeylen = 109, fObjlen = 10165998, noutot = 0, nout=0, nin=7216574, nbuf=10165998 == CMSSW: == CMSSW: ----- End Fatal Exception -------------------------------------------- I've sent a PM suggesting a project reset. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Failures have shot up dramatically! Since about 15.20UTC. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,011 RAC: 1,487 ![]() ![]() ![]() |
Failures have shot up dramatically! From several IP-addresses. Most with error code 8001 and state pending about 20 minutes after start time. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
It seems a sever issue. My jobs are stuck in WNPostProc. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Sorry, I was busy with my real work and only just noticed. I'll check (I have my suspicions...). I've submitted another batch of 2,500 jobs but CRAB is apparently slow today. I'll have to check later from home if they went through. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
It seems a sever issue. My jobs are stuck in WNPostProc. I don't see anything obvious -- proxy OK, disk space OK, machine idle. They aren't showing up in the logs but do seem to be in post-proc. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Did the proxy lease expire? Or something like that?Maybe the credentials are expired? EDIT: exactly 14.35UTC all went wrong.All jobs in postproc or failing. It has been pretty much exactly 2.5days(60h), since the batch started until this happened.Does that help? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1) This was changed 2 days ago. I suggest to set it back to 1200sec (default). At least, for a little while, to see, if things change. Unless, you have a better idea. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Did the proxy lease expire? Or something like that?Maybe the credentials are expired? Proxy is from 15/3, should be good for nearly another 5 days or the jobs wouldn't start, I think. Do your stuck jobs have normal log files at your end? [Edit] Ah! I missed that the VM was rebooted again today because of the nss bug -- at 14:47UTC. So it might be the proxy, ISTR we had proxy problems last time it rebooted. I've copied the proxy from tonight's batch into the older one's area. [/Edit] ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Do your stuck jobs have normal log files at your end? Sorry, can't help you there. Maybe one of the others can. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
One successful job in the new batch! |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
One successful job in the new batch! Three now. The last failure was 1918, these have all finished since then... [Edit: Oops, no, that was the last batch, current batch still failing./] ...but no job logs for the previous batch since 1444, and the server booted at 1447. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I started a new task about 20min ago. I am still concerned about the Claim_Worklife. There was no noticeable beneficial effect, since it was changed, or was there? It may cause some undesired effects if set to -1 (indefinite). If there is an unlimited claim to one batch, how is that treated, when a new batch is started? Can one host have multiple claims open? Is there some higher order timeout, that counteracts an indefinite duration claim? |
©2025 CERN