Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2401 - Posted: 16 Mar 2016, 9:33:29 UTC - in response to Message 2400.  

FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field.
Coincidence?

"Fatal ROOT error". Must remember to check the logs when I get to work in a few minutes.
ID: 2401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2402 - Posted: 16 Mar 2016, 9:39:01 UTC

FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field.
Coincidence?


2 of the 3 "unknown" from the last batch are also from this host.
ID: 2402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 2408 - Posted: 16 Mar 2016, 10:10:37 UTC - in response to Message 2402.  

FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field.
Coincidence?


2 of the 3 "unknown" from the last batch are also from this host.

That host has done 179 jobs from current batch so far, so can't be that slow ;)
ID: 2408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2409 - Posted: 16 Mar 2016, 10:17:31 UTC - in response to Message 2408.  

Maybe he has several machines, one is the slowest.Job 19 walltime >7h.
ID: 2409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 2410 - Posted: 16 Mar 2016, 10:36:30 UTC - in response to Message 2409.  

Maybe he has several machines, one is the slowest.Job 19 walltime >7h.

You're right. It must be several machines on the same IP.
The 6 slowest Wall Times in minutes from that IP are:

217.88m
228.48m
278.67m
295.05m
296.43m
348.57m

You have to consider that wall time is counting on even when a task a suspended or throttled.
ID: 2410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2412 - Posted: 16 Mar 2016, 10:44:56 UTC

You have to consider that wall time is counting on even when a task a suspended or throttled.


True, but being that slow is causing some kind of problem(timeout?).
No matter, for what reason.
ID: 2412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2413 - Posted: 16 Mar 2016, 10:46:58 UTC - in response to Message 2409.  

Maybe he has several machines, one is the slowest.Job 19 walltime >7h.

He has... but these failures were all from the same machine and only from a task that started early this morning. There were 9 good runs earlier in this batch (~2 hrs each). There haven't been any failures since 0534 so maybe it was transient. PM sent.
ID: 2413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2415 - Posted: 16 Mar 2016, 10:50:10 UTC - in response to Message 2412.  

You have to consider that wall time is counting on even when a task a suspended or throttled.


True, but being that slow is causing some kind of problem(timeout?).
No matter, for what reason.

No, it's a code/disk/(network?) problem:
== CMSSW: Begin processing the 9th record. Run 1, Event 211259, LumiSection 2536 at 16-Mar-2016 06:20:30.796 CET
== CMSSW: R__unzip: error -5 in inflate (zlib)
== CMSSW: ----- Begin Fatal Exception 16-Mar-2016 06:20:37 CET-----------------------
== CMSSW: An exception of category 'FatalRootError' occurred while
== CMSSW: [0] Processing run: 1 lumi: 2536 event: 211259
== CMSSW: [1] Running path 'simulation_step'
== CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits'
== CMSSW: Additional Info:
== CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: fNbytes = 7216683, fKeylen = 109, fObjlen = 10165998, noutot = 0, nout=0, nin=7216574, nbuf=10165998
== CMSSW:
== CMSSW: ----- End Fatal Exception --------------------------------------------

I've sent a PM suggesting a project reset.
ID: 2415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2428 - Posted: 17 Mar 2016, 15:51:13 UTC
Last modified: 17 Mar 2016, 15:52:58 UTC

Failures have shot up dramatically!
Since about 15.20UTC.
ID: 2428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 2429 - Posted: 17 Mar 2016, 17:38:46 UTC - in response to Message 2428.  

Failures have shot up dramatically!
Since about 15.20UTC.

From several IP-addresses. Most with error code 8001 and state pending about 20 minutes after start time.
ID: 2429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2430 - Posted: 17 Mar 2016, 17:48:39 UTC - in response to Message 2429.  

It seems a sever issue. My jobs are stuck in WNPostProc.
ID: 2430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2431 - Posted: 17 Mar 2016, 17:49:22 UTC

Sorry, I was busy with my real work and only just noticed. I'll check (I have my suspicions...).

I've submitted another batch of 2,500 jobs but CRAB is apparently slow today. I'll have to check later from home if they went through.
ID: 2431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2432 - Posted: 17 Mar 2016, 17:57:55 UTC - in response to Message 2430.  

It seems a sever issue. My jobs are stuck in WNPostProc.

I don't see anything obvious -- proxy OK, disk space OK, machine idle. They aren't showing up in the logs but do seem to be in post-proc.
ID: 2432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2433 - Posted: 17 Mar 2016, 18:05:52 UTC
Last modified: 17 Mar 2016, 18:26:55 UTC

Did the proxy lease expire? Or something like that?Maybe the credentials are expired?

EDIT: exactly 14.35UTC all went wrong.All jobs in postproc or failing.


It has been pretty much exactly 2.5days(60h), since the batch started until this happened.Does that help?
ID: 2433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2434 - Posted: 17 Mar 2016, 19:04:45 UTC

CLAIM_WORKLIFE = -1 was ifThenElse(DynamicSlot =?= true,3600,-1)


This was changed 2 days ago.

I suggest to set it back to 1200sec (default).
At least, for a little while, to see, if things change.

Unless, you have a better idea.
ID: 2434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2435 - Posted: 17 Mar 2016, 19:09:50 UTC - in response to Message 2433.  
Last modified: 17 Mar 2016, 19:21:38 UTC

Did the proxy lease expire? Or something like that?Maybe the credentials are expired?

EDIT: exactly 14.35UTC all went wrong.All jobs in postproc or failing.


It has been pretty much exactly 2.5days(60h), since the batch started until this happened.Does that help?

Proxy is from 15/3, should be good for nearly another 5 days or the jobs wouldn't start, I think. Do your stuck jobs have normal log files at your end?

[Edit] Ah! I missed that the VM was rebooted again today because of the nss bug -- at 14:47UTC. So it might be the proxy, ISTR we had proxy problems last time it rebooted. I've copied the proxy from tonight's batch into the older one's area. [/Edit]
ID: 2435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2436 - Posted: 17 Mar 2016, 19:15:45 UTC

Do your stuck jobs have normal log files at your end?

Sorry, can't help you there.
Maybe one of the others can.
ID: 2436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2437 - Posted: 17 Mar 2016, 19:35:45 UTC

One successful job in the new batch!
ID: 2437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2438 - Posted: 17 Mar 2016, 19:55:29 UTC - in response to Message 2437.  
Last modified: 17 Mar 2016, 19:57:53 UTC

One successful job in the new batch!

Three now. The last failure was 1918, these have all finished since then...
[Edit: Oops, no, that was the last batch, current batch still failing./]
...but no job logs for the previous batch since 1444, and the server booted at 1447.
ID: 2438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2439 - Posted: 17 Mar 2016, 20:08:01 UTC - in response to Message 2438.  

I started a new task about 20min ago.

I am still concerned about the Claim_Worklife.
There was no noticeable beneficial effect, since it was changed, or was there?

It may cause some undesired effects if set to -1 (indefinite).
If there is an unlimited claim to one batch, how is that treated, when a new batch is started? Can one host have multiple claims open?
Is there some higher order timeout, that counteracts an indefinite duration claim?
ID: 2439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN