Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1519 - Posted: 24 Dec 2015, 13:59:43 UTC - in response to Message 1518.  


I have actually dug the user and machine names from the end of the log-file (the line that says "FINISHING on user-machine-pid with status X") but for each "interesting" one I have to use my BOINC admin account to find what IP that machine last used, to report to the crew chasing down stage-out problems.


Is it still a forlorn hope that we might gain access to "our" log files?
ID: 1519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1520 - Posted: 26 Dec 2015, 7:01:26 UTC - in response to Message 1519.  
Last modified: 26 Dec 2015, 7:04:31 UTC

Is it still a forlorn hope that we might gain access to "our" log files?

No. Keep bugging me about it from time to time. We have a CMS UK meeting in three weeks' time (14-15/1), I can discuss it with Andrew then if I remember...

Season's Greetings!
ID: 1520 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1521 - Posted: 26 Dec 2015, 23:38:07 UTC - in response to Message 1520.  
Last modified: 26 Dec 2015, 23:47:05 UTC

Is it still a forlorn hope that we might gain access to "our" log files?

No. Keep bugging me about it from time to time. We have a CMS UK meeting in three weeks' time (14-15/1), I can discuss it with Andrew then if I remember...

OK, thanks. I'll try to remember to ask you about VM job "saving", too. I've just watched one job lose 358 events and another lose 160. T4T (the older one) manages to do this flawlessly, as far as I can tell; so, at least in principle, it is doable.
I don't know how near I'm getting to not completing the 500 events in the 6 hours or so that the machines run in one session, if this happens then I'll give up.

Season's Greetings!

And, rather belatedly, Greetings to you and to all the other CMS participants.
ID: 1521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1522 - Posted: 27 Dec 2015, 11:54:23 UTC
Last modified: 27 Dec 2015, 11:55:22 UTC

I looked at the ip addresses for the first 20 or so successful jobs, that passed in the 3rd attempt and noticed, that in nearly all, one particular IP address keeps coming up in a failed attempt.
That seems strange.
In addition, 2 other ip addresses showing up more than once.

I guess,that is already known.

I also checked, when a job, i processed, actually has my ip on dashboard.
In about 30% of cases,the ip address is different to mine and there seems to be no correlation to the number of attempt (1st, 2nd or 3rd).

Dasboard is not all too accurate, but still indicates too me, that the majority of failed attempts are caused by very few hosts or their internet connection.

If a host(volunteer) is unaware about the outcome of its results, it will keep crunching and maybe producing "bad" results.

The larger the project, the bigger the problem.

I consider it essential,that the volunteer has some quick(possibly immediate) feedback about the quality of the results(pass/fail), he produces.
ID: 1522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1523 - Posted: 28 Dec 2015, 19:15:13 UTC

It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs.
ID: 1523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1524 - Posted: 29 Dec 2015, 0:12:28 UTC - in response to Message 1523.  

It is clear now, that measures need to be taken to stop individual hosts from failing large amounts of jobs.

Yes, you are right. I've known for a few days now that one particular user is producing most of the failed jobs. Unfortunately, it's in stage-out where I have little information on the reason for failure. There is analysis underway, but since this was identified on Christmas Eve there hasn't been much chance to identify the root cause. I've let it run until now to gather as many logs as possible.
I've been proposing for a while that a non-zero exit code should terminate a task before its 24 hours is up, both as a way of warning the volunteer that something is wrong and also to limit the number of jobs they can process in a given time. I'd also like to see a task continue until the end of its first job after (say) 24 hours to cut down on "wasted" jobs that get cut short at the current time limit. I'll be pushing both of these ideas in the New Year.
I see we're starting to wind down this batch quite quickly. I'll submit some more soon, but first I have the unenviable task of asking a volunteer to desist.
ID: 1524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1525 - Posted: 29 Dec 2015, 0:52:31 UTC - in response to Message 1524.  

Thanks for the reply, Ivan.
It is also unfortunate, that even jobs, that failed after the 3rd attempt, might actually be ok.

There was a burst of over 280 failed jobs from another specific IP address.
(server issue?)

Can you stop the server from issuing the same job to the same host(IP address))again?There were a number of jobs, that had the same IP address listed for two separate failed attempts.

I have been running 96h tasks and no failures(so far).

I'd also like to see a task continue until the end of its first job after (say) 24 hours


Did you mean LAST?
ID: 1525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 61
Message 1526 - Posted: 29 Dec 2015, 8:49:05 UTC

The IP's from the Job's detail information are unreliable.
Several times not the real IP is shown.
As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC.

Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time.
ID: 1526 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1527 - Posted: 29 Dec 2015, 10:09:04 UTC - in response to Message 1525.  

I'd also like to see a task continue until the end of its first job after (say) 24 hours


Did you mean LAST?

Perhaps didn't word that as well as I might have. I meant that the tasks could run for 24 hrs as now, but only stop when the then-current job ends.
ID: 1527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1528 - Posted: 29 Dec 2015, 10:13:32 UTC - in response to Message 1526.  

The IP's from the Job's detail information are unreliable.
Several times not the real IP is shown.
Yes, e.g. sometimes a broadband user's IP changes at the whim of his ISP.

As an example my current job is not pointing to my IP-address in the Netherlands, but to Washington DC.
And sometimes there are other reasons.

Ivan, you could write a finish file into BOINC's shared slot directory after a job has finished let's say after the VM has more than 18 hours run time.
That's one way of doing it. Unfortunately I'm not the implementer so I can only make suggestions.
ID: 1528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1530 - Posted: 30 Dec 2015, 2:29:58 UTC - in response to Message 1524.  

OK,we've wound down this batch and I'm all out of Roy Orbison on BBC4 -- and beer. So here's a new batch of smaller jobs. We've managed to temporarily remove some problematic hosts, so hopefully the overall failure rate will go down. Guess we'll find out soon!

Thanks everybody.
ID: 1530 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1532 - Posted: 1 Jan 2016, 0:50:05 UTC

Had a CMS job that had been running for 72 hours on my seldom-checked iMac - appeared "stuck" at 98.946% complete. Aborted it.

User 306
Task 74500
WU 69642
Host 873
Sent 28 Dec 2015, 11:30:34 UTC
Reported 31 Dec 2015, 22:56:12 UTC Aborted by user
Run Time 260,929.39
CPU Time 62,486.51
ID: 1532 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1534 - Posted: 1 Jan 2016, 11:19:31 UTC

I have noticed, that a job, that was interrupted briefly (cms-task suspended for 5 minutes)on resume, the job does not show an error for this job on dashboard and a new job is started.

Instead, it is run by a different computer , without updating the ip address or issuing a new attempt.

It looks on dashboard, as if the original host completed the task.

This is one reason, why a lot of IP addresses are wrong.

About 10% of all jobs, that were started the first time have initial errors and require a second attempt.

I do not know, how long (briefly) is, but if too long, the job is actually listed as "failed"and will be re-issued.
ID: 1534 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1535 - Posted: 1 Jan 2016, 11:19:34 UTC
Last modified: 1 Jan 2016, 11:21:48 UTC

Accidentally sent
ID: 1535 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1598 - Posted: 13 Jan 2016, 3:08:53 UTC

I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today...

User: 306 Host: 873 Task: 75318 Work Unit: 52667

In case you need log files or w/e, I'll leave it running for now.
ID: 1598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,420,934
RAC: 6,834
Message 1599 - Posted: 13 Jan 2016, 3:54:11 UTC - in response to Message 1598.  
Last modified: 13 Jan 2016, 3:56:55 UTC

I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today...

User: 306 Host: 873 Task: 75318 Work Unit: 52667

In case you need log files or w/e, I'll leave it running for now.


On your Boinc Manager click on that running CMS task on the Task Tab and then look at the Properties on the left side of the page
and look at the *CPU time at last checkpoint*
and there if it is not close to the actual *Elapsed time* then you should just abort that since it will never finish or be Valid.

Then before you load a new task check your VB Manager and see if you need to remove that task (also looking at the Virtual Media Manager
in the *File* tab of the VB Manager)
ID: 1599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1600 - Posted: 14 Jan 2016, 9:26:12 UTC

Another "runaway".
Fail rate increasing. Code 8001.
ID: 1600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1601 - Posted: 14 Jan 2016, 23:06:35 UTC

CPU time was right at 8:00:01. Reset project, downloading now, we'll see what happens!
ID: 1601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1602 - Posted: 15 Jan 2016, 0:01:35 UTC - in response to Message 1600.  
Last modified: 15 Jan 2016, 1:00:00 UTC

Another "runaway".
Fail rate increasing. Code 8001.

I was afraid of that when I saw the red increasing. Been busy with CMS UK meeting today yesterday (and tomorrow today). Not too well at the moment either; food poisoning or norovirus? I'll take a look.
ID: 1602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1603 - Posted: 15 Jan 2016, 0:35:09 UTC - in response to Message 1602.  

Get well, soon.

I thought, the sooner it will be caught, the lesser the damage.
Up to that point, the fail rate was well below 1%.
ID: 1603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN