Thread 'issue of the day'

Author	Message
m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1519 - Posted: 24 Dec 2015, 13:59:43 UTC - in response to Message 1518. I have actually dug the user and machine names from the end of the log-file (the line that says "FINISHING on user-machine-pid with status X") but for each "interesting" one I have to use my BOINC admin account to find what IP that machine last used, to report to the crew chasing down stage-out problems. Is it still a forlorn hope that we might gain access to "our" log files? ID: 1519 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1520 - Posted: 26 Dec 2015, 7:01:26 UTC - in response to Message 1519. Last modified: 26 Dec 2015, 7:04:31 UTC Is it still a forlorn hope that we might gain access to "our" log files? No. Keep bugging me about it from time to time. We have a CMS UK meeting in three weeks' time (14-15/1), I can discuss it with Andrew then if I remember... Season's Greetings! ID: 1520 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1521 - Posted: 26 Dec 2015, 23:38:07 UTC - in response to Message 1520. Last modified: 26 Dec 2015, 23:47:05 UTC Is it still a forlorn hope that we might gain access to "our" log files? No. Keep bugging me about it from time to time. We have a CMS UK meeting in three weeks' time (14-15/1), I can discuss it with Andrew then if I remember... OK, thanks. I'll try to remember to ask you about VM job "saving", too. I've just watched one job lose 358 events and another lose 160. T4T (the older one) manages to do this flawlessly, as far as I can tell; so, at least in principle, it is doable. I don't know how near I'm getting to not completing the 500 events in the 6 hours or so that the machines run in one session, if this happens then I'll give up. Season's Greetings! And, rather belatedly, Greetings to you and to all the other CMS participants. ID: 1521 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1522 - Posted: 27 Dec 2015, 11:54:23 UTC Last modified: 27 Dec 2015, 11:55:22 UTC I looked at the ip addresses for the first 20 or so successful jobs, that passed in the 3rd attempt and noticed, that in nearly all, one particular IP address keeps coming up in a failed attempt. That seems strange. In addition, 2 other ip addresses showing up more than once. I guess,that is already known. I also checked, when a job, i processed, actually has my ip on dashboard. In about 30% of cases,the ip address is different to mine and there seems to be no correlation to the number of attempt (1st, 2nd or 3rd). Dasboard is not all too accurate, but still indicates too me, that the majority of failed attempts are caused by very few hosts or their internet connection. If a host(volunteer) is unaware about the outcome of its results, it will keep crunching and maybe producing "bad" results. The larger the project, the bigger the problem. I consider it essential,that the volunteer has some quick(possibly immediate) feedback about the quality of the results(pass/fail), he produces. ID: 1522 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1523 - Posted: 28 Dec 2015, 19:15:13 UTC It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs. ID: 1523 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1524 - Posted: 29 Dec 2015, 0:12:28 UTC - in response to Message 1523. It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs. Yes, you are right. I've known for a few days now that one particular user is producing most of the failed jobs. Unfortunately, it's in stage-out where I have little information on the reason for failure. There is analysis underway, but since this was identified on Christmas Eve there hasn't been much chance to identify the root cause. I've let it run until now to gather as many logs as possible. I've been proposing for a while that a non-zero exit code should terminate a task before its 24 hours is up, both as a way of warning the volunteer that something is wrong and also to limit the number of jobs they can process in a given time. I'd also like to see a task continue until the end of its first job after (say) 24 hours to cut down on "wasted" jobs that get cut short at the current time limit. I'll be pushing both of these ideas in the New Year. I see we're starting to wind down this batch quite quickly. I'll submit some more soon, but first I have the unenviable task of asking a volunteer to desist. ID: 1524 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1525 - Posted: 29 Dec 2015, 0:52:31 UTC - in response to Message 1524. Thanks for the reply, Ivan. It is also unfortunate, that even jobs, that failed after the 3rd attempt, might actually be ok. There was a burst of over 280 failed jobs from another specific IP address. (server issue?) Can you stop the server from issuing the same job to the same host(IP address))again?There were a number of jobs, that had the same IP address listed for two separate failed attempts. I have been running 96h tasks and no failures(so far). I'd also like to see a task continue until the end of its first job after (say) 24 hours Did you mean LAST? ID: 1525 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 109	Message 1526 - Posted: 29 Dec 2015, 8:49:05 UTC The IP's from the Job's detail information are unreliable. Several times not the real IP is shown. As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC. Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time. ID: 1526 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1527 - Posted: 29 Dec 2015, 10:09:04 UTC - in response to Message 1525. I'd also like to see a task continue until the end of its first job after (say) 24 hours Did you mean LAST? Perhaps didn't word that as well as I might have. I meant that the tasks could run for 24 hrs as now, but only stop when the then-current job ends. ID: 1527 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1528 - Posted: 29 Dec 2015, 10:13:32 UTC - in response to Message 1526. The IP's from the Job's detail information are unreliable. Several times not the real IP is shown. Yes, e.g. sometimes a broadband user's IP changes at the whim of his ISP. As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC. And sometimes there are other reasons. Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time. That's one way of doing it. Unfortunately I'm not the implementer so I can only make suggestions. ID: 1528 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1530 - Posted: 30 Dec 2015, 2:29:58 UTC - in response to Message 1524. OK,we've wound down this batch and I'm all out of Roy Orbison on BBC4 -- and beer. So here's a new batch of smaller jobs. We've managed to temporarily remove some problematic hosts, so hopefully the overall failure rate will go down. Guess we'll find out soon! Thanks everybody. ID: 1530 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1532 - Posted: 1 Jan 2016, 0:50:05 UTC Had a CMS job that had been running for 72 hours on my seldom-checked iMac - appeared "stuck" at 98.946% complete. Aborted it. User 306 Task 74500 WU 69642 Host 873 Sent 28 Dec 2015, 11:30:34 UTC Reported 31 Dec 2015, 22:56:12 UTC Aborted by user Run Time 260,929.39 CPU Time 62,486.51 ID: 1532 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1534 - Posted: 1 Jan 2016, 11:19:31 UTC I have noticed, that a job, that was interrupted briefly (cms-task suspended for 5 minutes)on resume, the job does not show an error for this job on dashboard and a new job is started. Instead, it is run by a different computer , without updating the ip address or issuing a new attempt. It looks on dashboard, as if the original host completed the task. This is one reason, why a lot of IP addresses are wrong. About 10% of all jobs, that were started the first time have initial errors and require a second attempt. I do not know, how long (briefly) is, but if too long, the job is actually listed as "failed"and will be re-issued. ID: 1534 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1535 - Posted: 1 Jan 2016, 11:19:34 UTC Last modified: 1 Jan 2016, 11:21:48 UTC Accidentally sent ID: 1535 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1598 - Posted: 13 Jan 2016, 3:08:53 UTC I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today... User: 306 Host: 873 Task: 75318 Work Unit: 52667 In case you need log files or w/e, I'll leave it running for now. ID: 1598 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 993 Credit: 17,768,104 RAC: 18,475	Message 1599 - Posted: 13 Jan 2016, 3:54:11 UTC - in response to Message 1598. Last modified: 13 Jan 2016, 3:56:55 UTC I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today... User: 306 Host: 873 Task: 75318 Work Unit: 52667 In case you need log files or w/e, I'll leave it running for now. On your Boinc Manager click on that running CMS task on the Task Tab and then look at the Properties on the left side of the page and look at the CPU time at last checkpoint and there if it is not close to the actual Elapsed time then you should just abort that since it will never finish or be Valid. Then before you load a new task check your VB Manager and see if you need to remove that task (also looking at the Virtual Media Manager in the File tab of the VB Manager) ID: 1599 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1600 - Posted: 14 Jan 2016, 9:26:12 UTC Another "runaway". Fail rate increasing. Code 8001. ID: 1600 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1601 - Posted: 14 Jan 2016, 23:06:35 UTC CPU time was right at 8:00:01. Reset project, downloading now, we'll see what happens! ID: 1601 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 222	Message 1602 - Posted: 15 Jan 2016, 0:01:35 UTC - in response to Message 1600. Last modified: 15 Jan 2016, 1:00:00 UTC Another "runaway". Fail rate increasing. Code 8001. I was afraid of that when I saw the red increasing. Been busy with CMS UK meeting ~~today~~ yesterday (and ~~tomorrow~~ today). Not too well at the moment either; food poisoning or norovirus? I'll take a look. ID: 1602 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1603 - Posted: 15 Jan 2016, 0:35:09 UTC - in response to Message 1602. Get well, soon. I thought, the sooner it will be caught, the lesser the damage. Up to that point, the fail rate was well below 1%. ID: 1603 · Rating: 0 · rate: / Reply Quote

Development for LHC@home