Message boards :
CMS Application :
Stageout failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
We have just started having a large number of jobs fail stage-out. The error messages suggest that a host is down, or possibly a certificate expired. Suggest suspending the project, putting hosts to "No New Tasks", or deselecting CMS, to avoid wasting cycles, until the problem is identified and fixed. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future. Thanks, Laurence, the log files are showing normal stage-outs again, so everyone can resume with CMS tasks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
All red on the Graphs---Dashboard? Or is there a real problem? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
All red on the Graphs---Dashboard? No, there's a real problem. I didn't spot it before I went to bed last night, but I couldn't have done anything anyway... Apparently an automatic update on the Data Bridge, or thereabouts, broke our storage configuration. I expect it will be remedied Real Soon Now. That's the 4th or 5th time in recent weeks that an update has killed part of our processing chain. Yet people wonder why I turn off automatic updates as a matter of course whenever I can! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Ivan. Has there been any progress in finding the cause for the stage-out problems? If that was to be fixed, the error rate would be excellent. It seems, that there is not much communication going on.People just upgrading servers, without notification, is not to be expected in an organization like Cern. EDIT-You said it.Automatic updates is only something for the average home user. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yesterday some packages were pushed from the EPEL test repository to EPEL stable. One package needed was not pushed and this led to a binary incompatibly due to an API change. Auto updates are important to ensure that systems are secure, especially when operating a data center with 25822 machines, as updating manually doesn't scale. Procedures can be put in place to reduce the risk of problems but not necessarily avoid them altogether. Once the consolidation has been done, the focus will shift towards higher reliability. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
OK, problem is said to be understood and the fix is under way. One component had a changed API but not everything was updated to take account of it. :-( Speaking of automatic updates, I've found that if you can tell Windows 10 that you are on a metered connection (easy for WiFi, a bit harder to do for wired networks) then it won't automatically download and install updates. [Edit] Ah, Laurence got in ahead of me. [/Edit] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the replies, guys. It just seems, that the system in the past few weeks was getting increasing number of problems. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
Thanks for the replies, guys. I believe you are right. On the other hand, we have been making a lot of changes to try to get things more unified and stable. Hopefully those efforts will start to bear fruit soon. We (i.e. T3_CH_Volunteer) are running more jobs than ever, and as I discovered last night, are now the 3rd or 4th largest Tier-3 site in CMS (depending on metric), and one of the most efficient. (Apologies if you can't see that graph, I'm not sure if you need CMS credentials or not...) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, i can see it. Quite impressive. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It impresses me too! Maybe those other sites should just install BOINC :) There are discussions internally about light-weight sites and this approach seems to be a good model. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? I noticed, that whenever a 151 error is shown, another job finished at the same time. I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well). It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory. Comments? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? Could be the case depending on the bandwidth available.
Just faced this myself today. I have 4 cores but less that 8GB of RAM on the VMs I started. Could only get jobs to run if I configured 3 job slots. They have now relaxed the memory requirements for the WMAgent jobs to 1GB so hopefully I can use all the cores. These jobs should eventually be available for you sometime next month I hope. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Laurence. I am starting the single core tasks with the best offset (45min in my case). Then i will see, if i get any 151 errors. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? I had a lot of 151 errors after the condor server recovered from the last outage and still very few in the most recent WUs. At the moment I run only singlecore WUs from the non-dev project and only 1 WU at the same time. Therefore it seems to be very unlikely that 2 of my uploads interfere with each other. At least from the perspective of a single client host. It may be another situation if you compare the arrival times from different clients from the serverĀ“s perspective. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Same here - 1 single CMS task running: 2016-12-16 18:10:00 (2888): Guest Log: [INFO] New Job Starting in slot1 2016-12-16 18:10:00 (2888): Guest Log: [INFO] Condor JobID: 1813673 in slot1 2016-12-16 18:10:20 (2888): Guest Log: [INFO] CRAB ID: 6490 in slot1 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Was that the only task running, or did you have others running? Just making sure. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Was that the only task running, or did you have others running? At about the same time, I also had 2 Theory's from the main LHC@home project running. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
That's likely the "echo" problem I've mentioned elsewhere. If a job has successfully staged-out its results, but for whatever reason Condor doesn't recognise that, the retry will fail because the result file is already on the data-bridge and a 151 error is raised. We need to address that, and a few other similar problems, once we've stabilised ongoing improvements. Please bear with us, have a happy Festive Season, and if you'd rather concentrate on the latter for the next few weeks we shall not mind! |
©2024 CERN