Message boards : CMS Application : Stageout failures
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4467 - Posted: 6 Dec 2016, 20:20:53 UTC
Last modified: 6 Dec 2016, 20:28:22 UTC

We have just started having a large number of jobs fail stage-out. The error messages suggest that a host is down, or possibly a certificate expired. Suggest suspending the project, putting hosts to "No New Tasks", or deselecting CMS, to avoid wasting cycles, until the problem is identified and fixed.
ID: 4467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 4468 - Posted: 6 Dec 2016, 21:16:21 UTC - in response to Message 4467.  

An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future.
ID: 4468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4469 - Posted: 6 Dec 2016, 21:43:18 UTC - in response to Message 4468.  

An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future.

Thanks, Laurence, the log files are showing normal stage-outs again, so everyone can resume with CMS tasks.
ID: 4469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4482 - Posted: 14 Dec 2016, 8:01:04 UTC

All red on the Graphs---Dashboard?

Or is there a real problem?
ID: 4482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4483 - Posted: 14 Dec 2016, 8:33:11 UTC - in response to Message 4482.  

All red on the Graphs---Dashboard?

Or is there a real problem?

No, there's a real problem. I didn't spot it before I went to bed last night, but I couldn't have done anything anyway... Apparently an automatic update on the Data Bridge, or thereabouts, broke our storage configuration. I expect it will be remedied Real Soon Now.
That's the 4th or 5th time in recent weeks that an update has killed part of our processing chain. Yet people wonder why I turn off automatic updates as a matter of course whenever I can!
ID: 4483 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4484 - Posted: 14 Dec 2016, 8:56:25 UTC - in response to Message 4483.  
Last modified: 14 Dec 2016, 9:02:46 UTC

Thanks Ivan.

Has there been any progress in finding the cause for the stage-out problems?
If that was to be fixed, the error rate would be excellent.

It seems, that there is not much communication going on.People just upgrading servers, without notification, is not to be expected in an organization like Cern.

EDIT-You said it.Automatic updates is only something for the average home user.
ID: 4484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 4486 - Posted: 14 Dec 2016, 9:43:01 UTC - in response to Message 4484.  

Yesterday some packages were pushed from the EPEL test repository to EPEL stable. One package needed was not pushed and this led to a binary incompatibly due to an API change.

Auto updates are important to ensure that systems are secure, especially when operating a data center with 25822 machines, as updating manually doesn't scale. Procedures can be put in place to reduce the risk of problems but not necessarily avoid them altogether. Once the consolidation has been done, the focus will shift towards higher reliability.
ID: 4486 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4487 - Posted: 14 Dec 2016, 9:46:42 UTC - in response to Message 4484.  
Last modified: 14 Dec 2016, 9:47:43 UTC

OK, problem is said to be understood and the fix is under way. One component had a changed API but not everything was updated to take account of it. :-(
Speaking of automatic updates, I've found that if you can tell Windows 10 that you are on a metered connection (easy for WiFi, a bit harder to do for wired networks) then it won't automatically download and install updates.
[Edit] Ah, Laurence got in ahead of me. [/Edit]
ID: 4487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4488 - Posted: 14 Dec 2016, 9:59:54 UTC

Thanks for the replies, guys.

It just seems, that the system in the past few weeks was getting increasing number of problems.
ID: 4488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4489 - Posted: 14 Dec 2016, 10:33:55 UTC - in response to Message 4488.  
Last modified: 14 Dec 2016, 10:35:29 UTC

Thanks for the replies, guys.

It just seems, that the system in the past few weeks was getting increasing number of problems.

I believe you are right. On the other hand, we have been making a lot of changes to try to get things more unified and stable. Hopefully those efforts will start to bear fruit soon. We (i.e. T3_CH_Volunteer) are running more jobs than ever, and as I discovered last night, are now the 3rd or 4th largest Tier-3 site in CMS (depending on metric), and one of the most efficient. (Apologies if you can't see that graph, I'm not sure if you need CMS credentials or not...)
ID: 4489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4490 - Posted: 14 Dec 2016, 10:43:19 UTC - in response to Message 4489.  

Yes, i can see it.
Quite impressive.
ID: 4490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 4491 - Posted: 14 Dec 2016, 12:42:41 UTC - in response to Message 4490.  

It impresses me too! Maybe those other sites should just install BOINC :) There are discussions internally about light-weight sites and this approach seems to be a good model.
ID: 4491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4501 - Posted: 16 Dec 2016, 21:29:38 UTC

Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time?
I noticed, that whenever a 151 error is shown, another job finished at the same time.
I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well).

It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory.

Comments?
ID: 4501 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 4502 - Posted: 16 Dec 2016, 22:35:45 UTC - in response to Message 4501.  

Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time?
I noticed, that whenever a 151 error is shown, another job finished at the same time.
I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well).


Could be the case depending on the bandwidth available.


It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory.


Just faced this myself today. I have 4 cores but less that 8GB of RAM on the VMs I started. Could only get jobs to run if I configured 3 job slots. They have now relaxed the memory requirements for the WMAgent jobs to 1GB so hopefully I can use all the cores. These jobs should eventually be available for you sometime next month I hope.
ID: 4502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4503 - Posted: 16 Dec 2016, 22:48:06 UTC - in response to Message 4502.  

Thanks, Laurence.

I am starting the single core tasks with the best offset (45min in my case).
Then i will see, if i get any 151 errors.
ID: 4503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 4505 - Posted: 17 Dec 2016, 8:22:00 UTC - in response to Message 4501.  

Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time?
I noticed, that whenever a 151 error is shown, another job finished at the same time.
I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well).

It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory.

Comments?

I had a lot of 151 errors after the condor server recovered from the last outage and still very few in the most recent WUs.

At the moment I run only singlecore WUs from the non-dev project and only 1 WU at the same time. Therefore it seems to be very unlikely that 2 of my uploads interfere with each other. At least from the perspective of a single client host.
It may be another situation if you compare the arrival times from different clients from the serverĀ“s perspective.
ID: 4505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 4506 - Posted: 17 Dec 2016, 8:58:25 UTC

Same here - 1 single CMS task running:

2016-12-16 18:10:00 (2888): Guest Log: [INFO] New Job Starting in slot1
2016-12-16 18:10:00 (2888): Guest Log: [INFO] Condor JobID: 1813673 in slot1
2016-12-16 18:10:20 (2888): Guest Log: [INFO] CRAB ID: 6490 in slot1
2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151.
ID: 4506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4507 - Posted: 17 Dec 2016, 10:14:09 UTC - in response to Message 4506.  

Was that the only task running, or did you have others running?
Just making sure.
ID: 4507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 9
Message 4508 - Posted: 17 Dec 2016, 12:56:33 UTC - in response to Message 4507.  

Was that the only task running, or did you have others running?
Just making sure.

At about the same time, I also had 2 Theory's from the main LHC@home project running.
ID: 4508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 46
Message 4509 - Posted: 17 Dec 2016, 12:57:57 UTC - in response to Message 4505.  

That's likely the "echo" problem I've mentioned elsewhere. If a job has successfully staged-out its results, but for whatever reason Condor doesn't recognise that, the retry will fail because the result file is already on the data-bridge and a 151 error is raised. We need to address that, and a few other similar problems, once we've stabilised ongoing improvements. Please bear with us, have a happy Festive Season, and if you'd rather concentrate on the latter for the next few weeks we shall not mind!
ID: 4509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : CMS Application : Stageout failures


©2024 CERN