Message boards : Number crunching : Expect errors eventually
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful. ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Thanks for the warning. Could this explain the unexplained "validate errors" that are causing some disquiet over there, one wonders. Atlas themselves don't say much. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Thanks for the warning. Could this explain the unexplained "validate errors" thatNo, I can't comment. I know next-to-nothing about the Atlas project. ![]() |
Send message Joined: 13 Feb 15 Posts: 1221 Credit: 920,540 RAC: 1,548 ![]() ![]() ![]() |
No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful. Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch. Is the result file about 66MB and transferred to you without compression? No problem here - was transferred within ~32 seconds. |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
if necessary, you can make it selectable for the user if he wants to process only small or even larger Jobs |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch. Yes to the first; not sure about the second, I've not seen anything explicit but it may be a default (not much point in compressing though, the ROOT files are already compressed). ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Is there a chance to make it possible to use 2 or more cores? I have tried trough the app_info.xml, but it really only uses one, even though the virtual box has two assigned. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I had a very long delay between jobs. One hour between the end of event 50 and event 1 of the next job. Normal is about 20 minutes. Which log might contain any info on that? All console windows did not show anything unusual. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I had a very long delay between jobs. There's a glide-in log, isn't there? I can't run jobs at home, due to my poor ADSL, so I have to rely on memory. Check to see if it was within one glide-in instance (which calls several successive jobs) or between glide-ins. The logs I get back are some combination of Condor and application logs, on a per-job basis so I don't think they have much inter-job information. We seem to be running smoothly at the moment, so it might have been a network blockage somewhere. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
glidein download started 6 min before the 1 event of the following run. This would explain 6 of the extra 40 min it took to start the next job. If there was a network outage of some kind, would that not be logging the retries? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
If there was a network outage of some kind, would that not be logging the retries?No idea, really, I'm not a network expert. ...but I have seen connections hang for some time and come back up when the physical link is restored. As I recall, Linux is(was) far more resilient than Windows in this. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
The delay happened before 17.12h . If it helps, look at the bold section(or show it to someone). I will keep an eye on it to see, if this happens again. It is not a serious issue, as it keeps running, but if you( or someone)has the time... Run-1 master.log |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right? ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Only done one so far (74). I think 136MB and 2h 50mins wall time from Dashboard but it took nearly an hour to get the job. Don't know how to find out how long upload actually took but speed normally ca 900k so say 2.5 mins. Edit. The following job (20) has already failed "stageout 60311" twice so it will be interesting to see if it works here.... but I'll be asleep by then. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Job 20 finished here OK. Each try is shown on Dashboard with a different IP, presumably the public IP of the machine on which it ran. |
Send message Joined: 13 Feb 15 Posts: 1221 Credit: 920,540 RAC: 1,548 ![]() ![]() ![]() |
OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). The 1st job in a new created VM takes always a bit longer. On my not the fastest i7 3 hours and 17 minutes. At the completion of the job 129.92 MB was uploaded within 75 seconds. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing. I think it's obvious we've hit a limit at this point, even before any careful analysis (Laurence has a new student to look at things like this, perhaps he can do the actual culling of statistics, they are all there in Dashboard). Next batch I'll step back to 75 ev/job to fill in an intermediate point -- the apparent step-up from 50 to 100 was a bit unexpected. ![]() |
![]() Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 ![]() ![]() |
My task 68516, work unit 63658, took 3 days to complete - spent most of it's time in "Waiting to run" state in spite of CMS-Dev being my highest BOINC priority project. Looking at the log, it is mostly 2015-10-27 21:46:47 (608): Status Report: virtualbox.exe/vboxheadless.exe is no longer running. Looks like when this happens the task attempts a retry MANY hours after the failure; don't understand VBox at all (can't get it to run on my Mac, where I do know what I'm doing, unlike Windows...). Would be nice if either was notified something was wrong so it could be fixed on my end, or if task would just abort with an error, rather than keeping that CPU from running CMS indefinitely. (Until reboot?) That machine (Phoenix) was overheating with a GPU problem, fixed yesterday, or it might still be cycling the same task... |
©2025 CERN