Message boards :
Number crunching :
issue of the day
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Get well, soon.Thanks, it hasn't recurred. ...yet! I thought, the sooner it will be caught, the lesser the damage. Sure, it was looking good. Anyway, PM and e-mail sent. Looks like another corrupt VM, problems loading files from the cvmfs cached (read-only) file system. And so to bed, meeting resumes at 0930 today. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 2,032 |
Good luck Ivan (yeah I said that before) It will probably help if you can still get the hosts that left back again in the future. -Samson Mad Scientist For Life |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
It will probably help if you can still get the hosts that left back again in the future. I know. I think we've at least identified, and in most cases "fixed", a lot of the failure modes. But we can't do much about people who won't read their mail. There's a new method of job submission coming Real Soon Now which should reduce my workload considerably, and should lead to "more interesting" work-flows. I've been promised increased attention from the CERN crew, let's see what happens in the next few weeks. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 2,032 |
It will probably help if you can still get the hosts that left back again in the future. Sounds good Ivan and you know I always check my emails and pm's and reply. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another "runaway". Responsible for the last 6 out of 7 total failures. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Another "runaway". :-( PM and e-mail sent... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? There is a way for BOINC. There is a maximum set in the server's database for how many tasks you can get a day - See your host-application details. For CMS this is still the default of 500 (much too high for this project) - can be modified by the project administrator. This 500 is raised with 1 for every valid task returned and reduced with 1 for every error. BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM. Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Sound good. I am just tired of single failing hosts ruining the success/failure ratio. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? Actually, unless they've changed (simplified) it, the bad-host handling is even better than that. It reduces the number of tasks when there are errors VERY quickly, down to one. Then as tasks succeed from that host, the number allowed increases slowly back up to the limit. This was put in place just for situations like this, but of course, as you say, CMS is not using the BOINC methods. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another "runaway" |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Another "runaway" PM sent. The one from mid-day seems to have fixed it. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
New problem?... Mac OS 10.11.2, BOINC 7.6.22. "Show VM Console" button gives error message "Missing application. Please download and install the CoRD application from http://cord.sourceforge.net". If that's something we need to have, that information needs to be posted with "system requirements" when the user signs up, along with what the heck it is, and why we need it. (Ideally with a link...) Had a runaway CMS task (33 hours with 8 left to go) and was trying the vLHC fix of editing the checkpoint file, part of which requires checking the VM, which of course on a Mac can't be done from VBox itself... Sigh. And no, the checkpoint-edit fix didn't work, tried repeatedly. Finally aborted the task. No biggie at this point! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Beyond my ken. I've yet to work up the courage to ask The Professor to buy me a Mac to investigate problems like these. :-) He does keep me well-supplied with duplex dual-10-core Xeons, though! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
OK, transient overnight problem seems to have evaporated. New batch of 250-event Minimum Bias simulation jobs submitted. Hopefully that'll mean I can get at other stuff now... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is this project now purely CMS-dev, again? The CMS-app has disappeared from the vLHC site without any explanation. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
Laurence did post on the vLHC forum that he had disabled it there. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, i found it. Not much of an explanation. How are the users supposed to know, what is going on, if it will be resumed, if they should abort the CMS task ..etc.? This should be at the top of the main page, not buried in some thread somewhere. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
The numbers of running and successful jobs are falling down, probably cause BOINC is ending vLHC-CMS-tasks coming to their 24 hours runtime limit. However maybe also because of the issue I have. This morning I fetched a new task and a new VM was created and booted. That was all! Only a boot.log is created. No boinc user jobs, only user root. With top I sometimes see CERN's virtual machine file system process - cvmfs2 - so it seems the VM tries to make a connection. My internet is OK, I can post here ;) . I let it run for 1.5 hours. Nothing happened. Last lines of boot.log: Sun Jan 31 09:04:10 2016: cms.cern.ch: Activating Fuse module Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode tracker... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring chunk tables... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode generation... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring open files counter... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved glue buffer Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing chunk tables Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved inode generation info Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing open files counter Sun Jan 31 09:04:12 2016: grid.cern.ch: Activating Fuse module |
©2024 CERN