Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I think the host, that produces an invalid result ever 20min or so has to be stopped. This host is responsible for ALL fails in the last two batches. Even if this host eventually responds to a PM,measures need to be taken to stop this kind of continuous fails. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Since about 15.50UTC the fails have increased dramatically.code 8001. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
We removed a temporary fix to try to speed up the boot time but it looks like it may still be needed. Have reverted back and we should see the result in a few hours. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
In the "CMS Jobs" plot, the number of running jobs do not agree with the plot below. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
In the "CMS Jobs" plot, the number of running jobs do not agree with the plot below. They come from two different sources. The headline numbers come directly from the Condor server (Andrew runs a cron job every minute to make a file Laurence picks up from the Web server) -- these currently contain a number of "lost" jobs from the recent kerfuffles. The plots come from Dashboard based on reports it receives from the server as jobs finish or are declared failures (AIUI). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
On a positive note: The number of "unknown" and "WNPstProc" state jobs are falling, slowly. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
On a positive note: Thanks. I'm still perplexed by the number of failures to find the site-config-local file more than 24 hours after we reversed our little "Oops!". I emailed one of the more prolific victims (72 failures on one machine) but haven't seen a response yet. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
On a positive note: Thanks. I'm still perplexed by the number of failures to find the site-config-local file more than 24 hours after we reversed our little "Oops!". I emailed one of the more prolific victims (72 failures on one machine) but haven't seen a response yet. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
This is, why we need a "clean run". Maybe smaller batch (2500) as we do not have that many volunteers(machines). BTW: i saw someones' cms-task finished yesterday, but "Waiting for validation" How is that possible? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
BTW: i saw someones' cms-task finished yesterday, but "Waiting for validation" Program: cms_bitwise_validator (CMS) ------ State: Not Running |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
This is, why we need a "clean run". Yeah, we've got a while to go with this batch to let things clear through the system. BTW, thanks for tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
Thanks, it is running again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
On a positive note: They are nearly down to zero.Excellent! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
Don't be alarmed at the current number of failures. A test batch of jobs was submitted with the wrong destination. They should clear up soon. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Don't be alarmed at the current number of failures. A test batch of jobs was submitted with the wrong destination. They should clear up soon. 2 other batches of 100 jobs from Leonardo Cristella were added. Short jobs of 25 events, but all are failing most due to "Stage Out Failure in ProdAgent job" |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
2 other batches of 100 jobs from Leonardo Cristella were added. Not 100% true. From the last batch most jobs finished successfully: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=Leonardo+Cristella&refresh=0&table=Jobs&p=1&records=200&activemenu=0&status=&site=&tid=160308_134103%3Alecriste_crab_CMS_at_Home_Phedex1 Interesting: These jobs with 25 events are short, but could have been faster, because a lot of sleep is permitted to us. Job Running time in seconds: 607 Job runtime is less than 20minutes. Sleeping 593 |
Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 |
Dear all, I just started to work on the CMS@Home project and I recently submitted a bunch of test CRAB tasks, please ignore them. I will reduce the number of jobs in my next tasks so that performances will not be affected. Best regards, Leonardo |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
ivan wrote: ... tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon. If you did not renewed the proxy on the quiet . . . above message is almost 6 days old and we still have over 2000 jobs from that batch to go. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
ivan wrote:... tickling my memory, I just renewed the proxy for the current batch, it would have expired this afternoon. Right, thanks. It's the initial one that's 7 days by default; I renewed it with the maximum I can get, 8 days -- but I do need to do it again soon. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
FYI: the last 4 fails(out of 6) were produced by the by far slowest machine in the field. Coincidence? |
©2024 CERN