Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1604 - Posted: 15 Jan 2016, 0:57:39 UTC - in response to Message 1603.  
Last modified: 15 Jan 2016, 0:58:22 UTC

Get well, soon.
Thanks, it hasn't recurred. ...yet!

I thought, the sooner it will be caught, the lesser the damage.
Up to that point, the fail rate was well below 1%.

Sure, it was looking good. Anyway, PM and e-mail sent. Looks like another corrupt VM, problems loading files from the cvmfs cached (read-only) file system.
And so to bed, meeting resumes at 0930 today.
ID: 1604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 751
Credit: 11,610,299
RAC: 1,436
Message 1605 - Posted: 15 Jan 2016, 20:41:01 UTC

Good luck Ivan (yeah I said that before)

It will probably help if you can still get the hosts that left back again in the future.

-Samson
Mad Scientist For Life
ID: 1605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1606 - Posted: 15 Jan 2016, 21:17:23 UTC - in response to Message 1605.  

It will probably help if you can still get the hosts that left back again in the future.

I know. I think we've at least identified, and in most cases "fixed", a lot of the failure modes. But we can't do much about people who won't read their mail.
There's a new method of job submission coming Real Soon Now which should reduce my workload considerably, and should lead to "more interesting" work-flows. I've been promised increased attention from the CERN crew, let's see what happens in the next few weeks.
ID: 1606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 751
Credit: 11,610,299
RAC: 1,436
Message 1607 - Posted: 16 Jan 2016, 7:48:15 UTC - in response to Message 1606.  

It will probably help if you can still get the hosts that left back again in the future.

I know. I think we've at least identified, and in most cases "fixed", a lot of the failure modes. But we can't do much about people who won't read their mail.
There's a new method of job submission coming Real Soon Now which should reduce my workload considerably, and should lead to "more interesting" work-flows. I've been promised increased attention from the CERN crew, let's see what happens in the next few weeks.



Sounds good Ivan and you know I always check my emails and pm's and reply.
ID: 1607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1648 - Posted: 27 Jan 2016, 12:24:04 UTC

Another "runaway".
Responsible for the last 6 out of 7 total failures.
ID: 1648 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1649 - Posted: 27 Jan 2016, 13:11:09 UTC - in response to Message 1648.  

Another "runaway".
Responsible for the last 6 out of 7 total failures.

:-( PM and e-mail sent...
ID: 1649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1650 - Posted: 27 Jan 2016, 13:23:32 UTC - in response to Message 1649.  
Last modified: 27 Jan 2016, 13:25:04 UTC

Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors?
ID: 1650 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 1651 - Posted: 27 Jan 2016, 14:17:39 UTC - in response to Message 1650.  

Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors?

There is a way for BOINC.

There is a maximum set in the server's database for how many tasks you can get a day - See your host-application details.
For CMS this is still the default of 500 (much too high for this project) - can be modified by the project administrator.
This 500 is raised with 1 for every valid task returned and reduced with 1 for every error.
BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM.
Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors.
ID: 1651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1652 - Posted: 27 Jan 2016, 14:48:54 UTC - in response to Message 1651.  

Sound good.
I am just tired of single failing hosts ruining the success/failure ratio.
ID: 1652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1661 - Posted: 27 Jan 2016, 17:05:26 UTC - in response to Message 1651.  

Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors?

There is a way for BOINC. <snip> This 500 is raised with 1 for every valid task returned and reduced with 1 for every error.
BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM.
Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors.


Actually, unless they've changed (simplified) it, the bad-host handling is even better than that. It reduces the number of tasks when there are errors VERY quickly, down to one. Then as tasks succeed from that host, the number allowed increases slowly back up to the limit. This was put in place just for situations like this, but of course, as you say, CMS is not using the BOINC methods.
ID: 1661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1662 - Posted: 27 Jan 2016, 17:50:08 UTC

Another "runaway"
ID: 1662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1664 - Posted: 27 Jan 2016, 19:55:14 UTC - in response to Message 1662.  

Another "runaway"

PM sent. The one from mid-day seems to have fixed it.
ID: 1664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1665 - Posted: 27 Jan 2016, 20:20:07 UTC

Thanks, Ivan.
ID: 1665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1735 - Posted: 30 Jan 2016, 11:13:56 UTC

New problem?... Mac OS 10.11.2, BOINC 7.6.22. "Show VM Console" button gives error message "Missing application. Please download and install the CoRD application from http://cord.sourceforge.net".

If that's something we need to have, that information needs to be posted with "system requirements" when the user signs up, along with what the heck it is, and why we need it. (Ideally with a link...)

Had a runaway CMS task (33 hours with 8 left to go) and was trying the vLHC fix of editing the checkpoint file, part of which requires checking the VM, which of course on a Mac can't be done from VBox itself... Sigh. And no, the checkpoint-edit fix didn't work, tried repeatedly. Finally aborted the task. No biggie at this point!
ID: 1735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1737 - Posted: 30 Jan 2016, 12:33:11 UTC - in response to Message 1735.  

Beyond my ken. I've yet to work up the courage to ask The Professor to buy me a Mac to investigate problems like these. :-) He does keep me well-supplied with duplex dual-10-core Xeons, though!
ID: 1737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 1738 - Posted: 30 Jan 2016, 12:35:07 UTC

OK, transient overnight problem seems to have evaporated. New batch of 250-event Minimum Bias simulation jobs submitted. Hopefully that'll mean I can get at other stuff now...
ID: 1738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1739 - Posted: 30 Jan 2016, 14:26:58 UTC

Is this project now purely CMS-dev, again?
The CMS-app has disappeared from the vLHC site without any explanation.
ID: 1739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,633,940
RAC: 15,831
Message 1740 - Posted: 30 Jan 2016, 14:31:17 UTC - in response to Message 1739.  

Laurence did post on the vLHC forum that he had disabled it there.
ID: 1740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1741 - Posted: 30 Jan 2016, 14:39:35 UTC - in response to Message 1740.  
Last modified: 30 Jan 2016, 14:45:07 UTC

Yes, i found it.
Not much of an explanation.

How are the users supposed to know, what is going on, if it will be resumed, if they should abort the CMS task ..etc.?

This should be at the top of the main page, not buried in some thread somewhere.
ID: 1741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 1751 - Posted: 31 Jan 2016, 8:20:57 UTC

The numbers of running and successful jobs are falling down, probably cause BOINC is ending vLHC-CMS-tasks coming to their 24 hours runtime limit.

However maybe also because of the issue I have.
This morning I fetched a new task and a new VM was created and booted.
That was all! Only a boot.log is created. No boinc user jobs, only user root.
With top I sometimes see CERN's virtual machine file system process - cvmfs2 - so it seems the VM tries to make a connection.

My internet is OK, I can post here ;) . I let it run for 1.5 hours. Nothing happened.

Last lines of boot.log:
Sun Jan 31 09:04:10 2016: cms.cern.ch: Activating Fuse module
Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode tracker...  done
Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring chunk tables...  done
Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode generation...  done
Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring open files counter...  done
Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved glue buffer
Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing chunk tables
Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved inode generation info
Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing open files counter
Sun Jan 31 09:04:12 2016: grid.cern.ch: Activating Fuse module
ID: 1751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN