Message boards :
Number crunching :
issue of the day
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Is it still a forlorn hope that we might gain access to "our" log files? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
Is it still a forlorn hope that we might gain access to "our" log files? No. Keep bugging me about it from time to time. We have a CMS UK meeting in three weeks' time (14-15/1), I can discuss it with Andrew then if I remember... Season's Greetings! |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Is it still a forlorn hope that we might gain access to "our" log files? OK, thanks. I'll try to remember to ask you about VM job "saving", too. I've just watched one job lose 358 events and another lose 160. T4T (the older one) manages to do this flawlessly, as far as I can tell; so, at least in principle, it is doable. I don't know how near I'm getting to not completing the 500 events in the 6 hours or so that the machines run in one session, if this happens then I'll give up. Season's Greetings! And, rather belatedly, Greetings to you and to all the other CMS participants. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I looked at the ip addresses for the first 20 or so successful jobs, that passed in the 3rd attempt and noticed, that in nearly all, one particular IP address keeps coming up in a failed attempt. That seems strange. In addition, 2 other ip addresses showing up more than once. I guess,that is already known. I also checked, when a job, i processed, actually has my ip on dashboard. In about 30% of cases,the ip address is different to mine and there seems to be no correlation to the number of attempt (1st, 2nd or 3rd). Dasboard is not all too accurate, but still indicates too me, that the majority of failed attempts are caused by very few hosts or their internet connection. If a host(volunteer) is unaware about the outcome of its results, it will keep crunching and maybe producing "bad" results. The larger the project, the bigger the problem. I consider it essential,that the volunteer has some quick(possibly immediate) feedback about the quality of the results(pass/fail), he produces. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs. Yes, you are right. I've known for a few days now that one particular user is producing most of the failed jobs. Unfortunately, it's in stage-out where I have little information on the reason for failure. There is analysis underway, but since this was identified on Christmas Eve there hasn't been much chance to identify the root cause. I've let it run until now to gather as many logs as possible. I've been proposing for a while that a non-zero exit code should terminate a task before its 24 hours is up, both as a way of warning the volunteer that something is wrong and also to limit the number of jobs they can process in a given time. I'd also like to see a task continue until the end of its first job after (say) 24 hours to cut down on "wasted" jobs that get cut short at the current time limit. I'll be pushing both of these ideas in the New Year. I see we're starting to wind down this batch quite quickly. I'll submit some more soon, but first I have the unenviable task of asking a volunteer to desist. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the reply, Ivan. It is also unfortunate, that even jobs, that failed after the 3rd attempt, might actually be ok. There was a burst of over 280 failed jobs from another specific IP address. (server issue?) Can you stop the server from issuing the same job to the same host(IP address))again?There were a number of jobs, that had the same IP address listed for two separate failed attempts. I have been running 96h tasks and no failures(so far). I'd also like to see a task continue until the end of its first job after (say) 24 hours Did you mean LAST? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
The IP's from the Job's detail information are unreliable. Several times not the real IP is shown. As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC. Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
I'd also like to see a task continue until the end of its first job after (say) 24 hours Perhaps didn't word that as well as I might have. I meant that the tasks could run for 24 hrs as now, but only stop when the then-current job ends. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
The IP's from the Job's detail information are unreliable.Yes, e.g. sometimes a broadband user's IP changes at the whim of his ISP. As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC.And sometimes there are other reasons. Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time.That's one way of doing it. Unfortunately I'm not the implementer so I can only make suggestions. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
OK,we've wound down this batch and I'm all out of Roy Orbison on BBC4 -- and beer. So here's a new batch of smaller jobs. We've managed to temporarily remove some problematic hosts, so hopefully the overall failure rate will go down. Guess we'll find out soon! Thanks everybody. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Had a CMS job that had been running for 72 hours on my seldom-checked iMac - appeared "stuck" at 98.946% complete. Aborted it. User 306 Task 74500 WU 69642 Host 873 Sent 28 Dec 2015, 11:30:34 UTC Reported 31 Dec 2015, 22:56:12 UTC Aborted by user Run Time 260,929.39 CPU Time 62,486.51 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have noticed, that a job, that was interrupted briefly (cms-task suspended for 5 minutes)on resume, the job does not show an error for this job on dashboard and a new job is started. Instead, it is run by a different computer , without updating the ip address or issuing a new attempt. It looks on dashboard, as if the original host completed the task. This is one reason, why a lot of IP addresses are wrong. About 10% of all jobs, that were started the first time have initial errors and require a second attempt. I do not know, how long (briefly) is, but if too long, the job is actually listed as "failed"and will be re-issued. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Accidentally sent |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today... User: 306 Host: 873 Task: 75318 Work Unit: 52667 In case you need log files or w/e, I'll leave it running for now. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,420,934 RAC: 6,834 |
I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today... On your Boinc Manager click on that running CMS task on the Task Tab and then look at the Properties on the left side of the page and look at the *CPU time at last checkpoint* and there if it is not close to the actual *Elapsed time* then you should just abort that since it will never finish or be Valid. Then before you load a new task check your VB Manager and see if you need to remove that task (also looking at the Virtual Media Manager in the *File* tab of the VB Manager) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another "runaway". Fail rate increasing. Code 8001. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
CPU time was right at 8:00:01. Reset project, downloading now, we'll see what happens! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
Another "runaway". I was afraid of that when I saw the red increasing. Been busy with CMS UK meeting |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Get well, soon. I thought, the sooner it will be caught, the lesser the damage. Up to that point, the fail rate was well below 1%. |
©2024 CERN