Thread 'issue of the day'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1604 - Posted: 15 Jan 2016, 0:57:39 UTC - in response to Message 1603. Last modified: 15 Jan 2016, 0:58:22 UTC Get well, soon. Thanks, it hasn't recurred. ...yet! I thought, the sooner it will be caught, the lesser the damage. Up to that point, the fail rate was well below 1%. Sure, it was looking good. Anyway, PM and e-mail sent. Looks like another corrupt VM, problems loading files from the cvmfs cached (read-only) file system. And so to bed, meeting resumes at 0930 today. ID: 1604 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 993 Credit: 17,766,434 RAC: 18,670	Message 1605 - Posted: 15 Jan 2016, 20:41:01 UTC Good luck Ivan (yeah I said that before) It will probably help if you can still get the hosts that left back again in the future. -Samson Mad Scientist For Life ID: 1605 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1606 - Posted: 15 Jan 2016, 21:17:23 UTC - in response to Message 1605. It will probably help if you can still get the hosts that left back again in the future. I know. I think we've at least identified, and in most cases "fixed", a lot of the failure modes. But we can't do much about people who won't read their mail. There's a new method of job submission coming Real Soon Now which should reduce my workload considerably, and should lead to "more interesting" work-flows. I've been promised increased attention from the CERN crew, let's see what happens in the next few weeks. ID: 1606 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 993 Credit: 17,766,434 RAC: 18,670	Message 1607 - Posted: 16 Jan 2016, 7:48:15 UTC - in response to Message 1606. It will probably help if you can still get the hosts that left back again in the future. I know. I think we've at least identified, and in most cases "fixed", a lot of the failure modes. But we can't do much about people who won't read their mail. There's a new method of job submission coming Real Soon Now which should reduce my workload considerably, and should lead to "more interesting" work-flows. I've been promised increased attention from the CERN crew, let's see what happens in the next few weeks. Sounds good Ivan and you know I always check my emails and pm's and reply. ID: 1607 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1648 - Posted: 27 Jan 2016, 12:24:04 UTC Another "runaway". Responsible for the last 6 out of 7 total failures. ID: 1648 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1649 - Posted: 27 Jan 2016, 13:11:09 UTC - in response to Message 1648. Another "runaway". Responsible for the last 6 out of 7 total failures. :-( PM and e-mail sent... ID: 1649 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1650 - Posted: 27 Jan 2016, 13:23:32 UTC - in response to Message 1649. Last modified: 27 Jan 2016, 13:25:04 UTC Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? ID: 1650 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 120	Message 1651 - Posted: 27 Jan 2016, 14:17:39 UTC - in response to Message 1650. Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? There is a way for BOINC. There is a maximum set in the server's database for how many tasks you can get a day - See your host-application details. For CMS this is still the default of 500 (much too high for this project) - can be modified by the project administrator. This 500 is raised with 1 for every valid task returned and reduced with 1 for every error. BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM. Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors. ID: 1651 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1652 - Posted: 27 Jan 2016, 14:48:54 UTC - in response to Message 1651. Sound good. I am just tired of single failing hosts ruining the success/failure ratio. ID: 1652 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1661 - Posted: 27 Jan 2016, 17:05:26 UTC - in response to Message 1651. Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors? There is a way for BOINC. <snip> This 500 is raised with 1 for every valid task returned and reduced with 1 for every error. BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM. Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors. Actually, unless they've changed (simplified) it, the bad-host handling is even better than that. It reduces the number of tasks when there are errors VERY quickly, down to one. Then as tasks succeed from that host, the number allowed increases slowly back up to the limit. This was put in place just for situations like this, but of course, as you say, CMS is not using the BOINC methods. ID: 1661 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1662 - Posted: 27 Jan 2016, 17:50:08 UTC Another "runaway" ID: 1662 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1664 - Posted: 27 Jan 2016, 19:55:14 UTC - in response to Message 1662. Another "runaway" PM sent. The one from mid-day seems to have fixed it. ID: 1664 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1665 - Posted: 27 Jan 2016, 20:20:07 UTC Thanks, Ivan. ID: 1665 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1735 - Posted: 30 Jan 2016, 11:13:56 UTC New problem?... Mac OS 10.11.2, BOINC 7.6.22. "Show VM Console" button gives error message "Missing application. Please download and install the CoRD application from http://cord.sourceforge.net". If that's something we need to have, that information needs to be posted with "system requirements" when the user signs up, along with what the heck it is, and why we need it. (Ideally with a link...) Had a runaway CMS task (33 hours with 8 left to go) and was trying the vLHC fix of editing the checkpoint file, part of which requires checking the VM, which of course on a Mac can't be done from VBox itself... Sigh. And no, the checkpoint-edit fix didn't work, tried repeatedly. Finally aborted the task. No biggie at this point! ID: 1735 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1737 - Posted: 30 Jan 2016, 12:33:11 UTC - in response to Message 1735. Beyond my ken. I've yet to work up the courage to ask The Professor to buy me a Mac to investigate problems like these. :-) He does keep me well-supplied with duplex dual-10-core Xeons, though! ID: 1737 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 245	Message 1738 - Posted: 30 Jan 2016, 12:35:07 UTC OK, transient overnight problem seems to have evaporated. New batch of 250-event Minimum Bias simulation jobs submitted. Hopefully that'll mean I can get at other stuff now... ID: 1738 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1739 - Posted: 30 Jan 2016, 14:26:58 UTC Is this project now purely CMS-dev, again? The CMS-app has disappeared from the vLHC site without any explanation. ID: 1739 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1740 - Posted: 30 Jan 2016, 14:31:17 UTC - in response to Message 1739. Laurence did post on the vLHC forum that he had disabled it there. ID: 1740 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1741 - Posted: 30 Jan 2016, 14:39:35 UTC - in response to Message 1740. Last modified: 30 Jan 2016, 14:45:07 UTC Yes, i found it. Not much of an explanation. How are the users supposed to know, what is going on, if it will be resumed, if they should abort the CMS task ..etc.? This should be at the top of the main page, not buried in some thread somewhere. ID: 1741 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 120	Message 1751 - Posted: 31 Jan 2016, 8:20:57 UTC mbers of running and successful jobs are falling down, probably cause BOINC is ending vLHC-CMS-tasks coming to their 24 hours runtime limit. However maybe also because of the issue I have. This morning I fetched a new task and a new VM was created and booted. That was all! Only a boot.log is created. No boinc user jobs, only user root. With top I sometimes see CERN's virtual machine file system process - cvmfs2 - so it seems the VM tries to make a connection. My internet is OK, I can post here ;) . I let it run for 1.5 hours. Nothing happened. Last lines of boot.log: [pre]Sun Jan 31 09:04:10 2016: cms.cern.ch: Activating Fuse module Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode tracker... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring chunk tables... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring inode generation... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Restoring open files counter... done Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved glue buffer Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing chunk tables Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing saved inode generation info Sun Jan 31 09:04:12 2016: grid.cern.ch: Releasing open files counter Sun Jan 31 09:04:12 2016: grid.cern.ch: Activating Fuse module[/pre] ID: 1751 · Rating: 0 · rate: / Reply Quote

Development for LHC@home