Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2109 - Posted: 26 Feb 2016, 21:00:07 UTC

I think the host, that produces an invalid result ever 20min or so has to be stopped.

This host is responsible for ALL fails in the last two batches.

Even if this host eventually responds to a PM,measures need to be taken to stop this kind of continuous fails.
ID: 2109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2168 - Posted: 2 Mar 2016, 19:46:16 UTC

Since about 15.50UTC the fails have increased dramatically.code 8001.
ID: 2168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2170 - Posted: 2 Mar 2016, 19:53:36 UTC - in response to Message 2168.  

We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours.
ID: 2170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2197 - Posted: 3 Mar 2016, 21:08:12 UTC

In the "CMS Jobs" plot, the number of running jobs do not agree with the plot below.
ID: 2197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2198 - Posted: 3 Mar 2016, 21:16:22 UTC - in response to Message 2197.  

In the "CMS Jobs" plot, the number of running jobs do not agree with the plot below.

They come from two different sources. The headline numbers come directly from the Condor server (Andrew runs a cron job every minute to make a file Laurence picks up from the Web server) -- these currently contain a number of "lost" jobs from the recent kerfuffles. The plots come from Dashboard based on reports it receives from the server as jobs finish or are declared failures (AIUI).
ID: 2198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2204 - Posted: 3 Mar 2016, 23:12:58 UTC

On a positive note:
The number of "unknown" and "WNPstProc" state jobs are falling, slowly.
ID: 2204 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2205 - Posted: 3 Mar 2016, 23:45:19 UTC - in response to Message 2204.  

On a positive note:
The number of "unknown" and "WNPstProc" state jobs are falling, slowly.

Thanks. I'm still perplexed by the number of failures to find the site-config-local file more than 24 hours after we reversed our little "Oops!". I emailed one of the more prolific victims (72 failures on one machine) but haven't seen a response yet.
ID: 2205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2206 - Posted: 3 Mar 2016, 23:45:20 UTC - in response to Message 2204.  

On a positive note:
The number of "unknown" and "WNPstProc" state jobs are falling, slowly.

Thanks. I'm still perplexed by the number of failures to find the site-config-local file more than 24 hours after we reversed our little "Oops!". I emailed one of the more prolific victims (72 failures on one machine) but haven't seen a response yet.
ID: 2206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2207 - Posted: 3 Mar 2016, 23:55:37 UTC - in response to Message 2205.  
Last modified: 4 Mar 2016, 0:09:38 UTC

This is, why we need a "clean run".
Maybe smaller batch (2500) as we do not have that many volunteers(machines).

BTW: i saw someones' cms-task finished yesterday, but "Waiting for validation"
How is that possible?
ID: 2207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2209 - Posted: 4 Mar 2016, 8:22:57 UTC - in response to Message 2207.  

BTW: i saw someones' cms-task finished yesterday, but "Waiting for validation"
How is that possible?

Program: cms_bitwise_validator (CMS) ------ State: Not Running
ID: 2209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2212 - Posted: 4 Mar 2016, 11:01:05 UTC - in response to Message 2207.  

This is, why we need a "clean run".
Maybe smaller batch (2500) as we do not have that many volunteers(machines).

Yeah, we've got a while to go with this batch to let things clear through the system. BTW, thanks for tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon.
ID: 2212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2213 - Posted: 4 Mar 2016, 11:09:47 UTC - in response to Message 2209.  

Thanks, it is running again.
ID: 2213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2223 - Posted: 4 Mar 2016, 13:47:43 UTC

On a positive note:
The number of "unknown" and "WNPstProc" state jobs are falling, slowly.


They are nearly down to zero.Excellent!
ID: 2223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2270 - Posted: 8 Mar 2016, 14:18:04 UTC

Don't be alarmed at the current number of failures. A test batch of jobs was submitted with the wrong destination. They should clear up soon.
ID: 2270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2272 - Posted: 8 Mar 2016, 15:51:14 UTC - in response to Message 2270.  

Don't be alarmed at the current number of failures. A test batch of jobs was submitted with the wrong destination. They should clear up soon.

2 other batches of 100 jobs from Leonardo Cristella were added.
Short jobs of 25 events, but all are failing most due to "Stage Out Failure in ProdAgent job"
ID: 2272 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2273 - Posted: 8 Mar 2016, 20:29:17 UTC - in response to Message 2272.  
Last modified: 8 Mar 2016, 20:38:17 UTC

2 other batches of 100 jobs from Leonardo Cristella were added.
Short jobs of 25 events, but all are failing most due to "Stage Out Failure in ProdAgent job"

Not 100% true. From the last batch most jobs finished successfully:

http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=Leonardo+Cristella&refresh=0&table=Jobs&p=1&records=200&activemenu=0&status=&site=&tid=160308_134103%3Alecriste_crab_CMS_at_Home_Phedex1

Interesting: These jobs with 25 events are short, but could have been faster, because a lot of sleep is permitted to us.

Job Running time in seconds: 607
Job runtime is less than 20minutes. Sleeping 593
ID: 2273 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 2274 - Posted: 8 Mar 2016, 23:12:11 UTC

Dear all,
I just started to work on the CMS@Home project and I recently submitted a bunch of test CRAB tasks, please ignore them.
I will reduce the number of jobs in my next tasks so that performances will not be affected.

Best regards,
Leonardo
ID: 2274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2298 - Posted: 10 Mar 2016, 9:01:50 UTC - in response to Message 2212.  
Last modified: 10 Mar 2016, 9:02:28 UTC

ivan wrote:
... tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon.

If you did not renewed the proxy on the quiet . . . above message is almost 6 days old and we still have over 2000 jobs from that batch to go.
ID: 2298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2299 - Posted: 10 Mar 2016, 9:12:06 UTC - in response to Message 2298.  

ivan wrote:
... tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon.

If you did not renewed the proxy on the quiet . . . above message is almost 6 days old and we still have over 2000 jobs from that batch to go.

Right, thanks. It's the initial one that's 7 days by default; I renewed it with the maximum I can get, 8 days -- but I do need to do it again soon.
ID: 2299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2400 - Posted: 16 Mar 2016, 9:07:50 UTC

FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field.
Coincidence?
ID: 2400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN