Thread 'Stageout failures'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4467 - Posted: 6 Dec 2016, 20:20:53 UTC Last modified: 6 Dec 2016, 20:28:22 UTC We have just started having a large number of jobs fail stage-out. The error messages suggest that a host is down, or possibly a certificate expired. Suggest suspending the project, putting hosts to "No New Tasks", or deselecting CMS, to avoid wasting cycles, until the problem is identified and fixed. ID: 4467 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4468 - Posted: 6 Dec 2016, 21:16:21 UTC - in response to Message 4467. An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future. ID: 4468 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4469 - Posted: 6 Dec 2016, 21:43:18 UTC - in response to Message 4468. An auto update of the host upgraded about 271 packages and some configuration files were altered. The machine has now been reconfigured and the problem is fixed. Puppet, our configuration management tool, has been enabled on this host so this should be automatically done after any upgrade in the future. Thanks, Laurence, the log files are showing normal stage-outs again, so everyone can resume with CMS tasks. ID: 4469 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4482 - Posted: 14 Dec 2016, 8:01:04 UTC All red on the Graphs---Dashboard? Or is there a real problem? ID: 4482 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4483 - Posted: 14 Dec 2016, 8:33:11 UTC - in response to Message 4482. All red on the Graphs---Dashboard? Or is there a real problem? No, there's a real problem. I didn't spot it before I went to bed last night, but I couldn't have done anything anyway... Apparently an automatic update on the Data Bridge, or thereabouts, broke our storage configuration. I expect it will be remedied Real Soon Now. That's the 4th or 5th time in recent weeks that an update has killed part of our processing chain. Yet people wonder why I turn off automatic updates as a matter of course whenever I can! ID: 4483 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4484 - Posted: 14 Dec 2016, 8:56:25 UTC - in response to Message 4483. Last modified: 14 Dec 2016, 9:02:46 UTC Thanks Ivan. Has there been any progress in finding the cause for the stage-out problems? If that was to be fixed, the error rate would be excellent. It seems, that there is not much communication going on.People just upgrading servers, without notification, is not to be expected in an organization like Cern. EDIT-You said it.Automatic updates is only something for the average home user. ID: 4484 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4486 - Posted: 14 Dec 2016, 9:43:01 UTC - in response to Message 4484. Yesterday some packages were pushed from the EPEL test repository to EPEL stable. One package needed was not pushed and this led to a binary incompatibly due to an API change. Auto updates are important to ensure that systems are secure, especially when operating a data center with 25822 machines, as updating manually doesn't scale. Procedures can be put in place to reduce the risk of problems but not necessarily avoid them altogether. Once the consolidation has been done, the focus will shift towards higher reliability. ID: 4486 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4487 - Posted: 14 Dec 2016, 9:46:42 UTC - in response to Message 4484. Last modified: 14 Dec 2016, 9:47:43 UTC OK, problem is said to be understood and the fix is under way. One component had a changed API but not everything was updated to take account of it. :-( Speaking of automatic updates, I've found that if you can tell Windows 10 that you are on a metered connection (easy for WiFi, a bit harder to do for wired networks) then it won't automatically download and install updates. [Edit] Ah, Laurence got in ahead of me. [/Edit] ID: 4487 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4488 - Posted: 14 Dec 2016, 9:59:54 UTC Thanks for the replies, guys. It just seems, that the system in the past few weeks was getting increasing number of problems. ID: 4488 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4489 - Posted: 14 Dec 2016, 10:33:55 UTC - in response to Message 4488. Last modified: 14 Dec 2016, 10:35:29 UTC Thanks for the replies, guys. It just seems, that the system in the past few weeks was getting increasing number of problems. I believe you are right. On the other hand, we have been making a lot of changes to try to get things more unified and stable. Hopefully those efforts will start to bear fruit soon. We (i.e. T3_CH_Volunteer) are running more jobs than ever, and as I discovered last night, are now the 3rd or 4th largest Tier-3 site in CMS (depending on metric), and one of the most efficient. (Apologies if you can't see that graph, I'm not sure if you need CMS credentials or not...) ID: 4489 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4490 - Posted: 14 Dec 2016, 10:43:19 UTC - in response to Message 4489. Yes, i can see it. Quite impressive. ID: 4490 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4491 - Posted: 14 Dec 2016, 12:42:41 UTC - in response to Message 4490. It impresses me too! Maybe those other sites should just install BOINC :) There are discussions internally about light-weight sites and this approach seems to be a good model. ID: 4491 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4501 - Posted: 16 Dec 2016, 21:29:38 UTC Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? I noticed, that whenever a 151 error is shown, another job finished at the same time. I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well). It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory. Comments? ID: 4501 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4502 - Posted: 16 Dec 2016, 22:35:45 UTC - in response to Message 4501. Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? I noticed, that whenever a 151 error is shown, another job finished at the same time. I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well). Could be the case depending on the bandwidth available. It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory. Just faced this myself today. I have 4 cores but less that 8GB of RAM on the VMs I started. Could only get jobs to run if I configured 3 job slots. They have now relaxed the memory requirements for the WMAgent jobs to 1GB so hopefully I can use all the cores. These jobs should eventually be available for you sometime next month I hope. ID: 4502 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4503 - Posted: 16 Dec 2016, 22:48:06 UTC - in response to Message 4502. Thanks, Laurence. I am starting the single core tasks with the best offset (45min in my case). Then i will see, if i get any 151 errors. ID: 4503 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 4505 - Posted: 17 Dec 2016, 8:22:00 UTC - in response to Message 4501. Is there a possibility, that the stage-out failures occur, when two jobs within a task finish at (or nearly) the same time? I noticed, that whenever a 151 error is shown, another job finished at the same time. I will check single core tasks, and if i am correct, there should be a lot less 151 errors(unless, they happen to finish at the same time as well). It would be nice to lower the memory requirement, that i can run 4 tasks with 8GB of memory. Comments? I had a lot of 151 errors after the condor server recovered from the last outage and still very few in the most recent WUs. At the moment I run only singlecore WUs from the non-dev project and only 1 WU at the same time. Therefore it seems to be very unlikely that 2 of my uploads interfere with each other. At least from the perspective of a single client host. It may be another situation if you compare the arrival times from different clients from the server´s perspective. ID: 4505 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4506 - Posted: 17 Dec 2016, 8:58:25 UTC Same here - 1 single CMS task running: 2016-12-16 18:10:00 (2888): Guest Log: [INFO] New Job Starting in slot1 2016-12-16 18:10:00 (2888): Guest Log: [INFO] Condor JobID: 1813673 in slot1 2016-12-16 18:10:20 (2888): Guest Log: [INFO] CRAB ID: 6490 in slot1 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. ID: 4506 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4507 - Posted: 17 Dec 2016, 10:14:09 UTC - in response to Message 4506. Was that the only task running, or did you have others running? Just making sure. ID: 4507 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 4508 - Posted: 17 Dec 2016, 12:56:33 UTC - in response to Message 4507. Was that the only task running, or did you have others running? Just making sure. At about the same time, I also had 2 Theory's from the main LHC@home project running. ID: 4508 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 4509 - Posted: 17 Dec 2016, 12:57:57 UTC - in response to Message 4505. That's likely the "echo" problem I've mentioned elsewhere. If a job has successfully staged-out its results, but for whatever reason Condor doesn't recognise that, the retry will fail because the result file is already on the data-bridge and a 151 error is raised. We need to address that, and a few other similar problems, once we've stabilised ongoing improvements. Please bear with us, have a happy Festive Season, and if you'd rather concentrate on the latter for the next few weeks we shall not mind! ID: 4509 · Rating: 0 · rate: / Reply Quote

Development for LHC@home