Thread 'issue of the day'

Author	Message
PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1450 - Posted: 13 Nov 2015, 12:28:56 UTC - in response to Message 1449. I'll go and have a check, but I think all the time, ie once an hour on every machine. I can't check all of them as the console doesn't work on some and the boot log is the only file showing through the 'Graphics' button. I can check cpu usage though. I'll be back... ID: 1450 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1451 - Posted: 13 Nov 2015, 12:40:50 UTC - in response to Message 1450. Yes, all machines running CMS are showing this error. They are all happily running vLHC though. Haven't received a CMS job through vLHC yet. Any reason the LHC@home check doesn't say vLHC@home ? ID: 1451 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1452 - Posted: 13 Nov 2015, 16:02:10 UTC - in response to Message 1451. I tried a re-boot but still the same. The CMS request takes a few seconds (5-10 ?) before it tries LHC, that takes no time at all before it puts out a curl command and then the 'Cloud not get' message. ID: 1452 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1469 - Posted: 16 Nov 2015, 15:31:55 UTC - in response to Message 1452. Still getting the 'Cloud not get' error. After the last line in the boot.log file of Activating Fuse module the console display flashes this up quickly: Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for Servername [OK] Starting vmcontext_epilog ... bootlogd: no process killed It says OK, so is it ? ID: 1469 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1470 - Posted: 16 Nov 2015, 16:24:29 UTC - in response to Message 1469. Still getting the 'Cloud not get' error. After the last line in the boot.log file of Activating Fuse module the console display flashes this up quickly: Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for Servername [OK] Starting vmcontext_epilog ... bootlogd: no process killed It says OK, so is it ? I believe so; certainly I've seen those messages with no obvious detrimental effect. ID: 1470 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1471 - Posted: 16 Nov 2015, 16:38:29 UTC - in response to Message 1470. That's what I'd thought I'd seen as well but after suffering withdrawal symptoms for so long you start to grasp at straws ! ID: 1471 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 160 Credit: 3,161,052 RAC: 5,888	Message 1472 - Posted: 16 Nov 2015, 16:41:36 UTC Yes, a normal running WU spottes so many Errors, that no normal Cruncher has a chance to find the real problem ID: 1472 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1479 - Posted: 18 Nov 2015, 8:51:17 UTC Linux host (Ubuntu) - after today's OS update and restart: CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)" Downloading VBox 5.0.10 now (was on 5.0.8). Sigh. ID: 1479 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1480 - Posted: 18 Nov 2015, 9:06:50 UTC - in response to Message 1479. Linux host (Ubuntu) - after today's OS update and restart: CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)" Downloading VBox 5.0.10 now (was on 5.0.8). Sigh. Well, that's quite usual for Linux. I have to re-run the Nvidia driver installer on my Linux machines with their GPUs every time there's a kernel update -- you're supposed to be able to set it up to do it automagically, but I've had problems with that. Also have to recompile the XeonPhi drivers after a kernel update and that's a whole other kettle of fish! ID: 1480 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 17 Aug 15 Posts: 62 Credit: 296,695 RAC: 0	Message 1481 - Posted: 18 Nov 2015, 13:19:11 UTC - in response to Message 1480. I've installed Leap 42.1, the latest SuSE release, as a Virtual Machine on this Windows host. I installed BOINC client and manager from SuSE and also Virtual Box. Nothing works. Then I downloaded gcc, make and kernel sources from SuSE and am trying to rebuild all that works on OpenSuSE 13.1 and 13.2 on my Linux boxes. Tullio ID: 1481 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 18 Aug 15 Posts: 14 Credit: 137,212 RAC: 0	Message 1482 - Posted: 18 Nov 2015, 15:22:16 UTC Bill Michael said: CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)" If you install DKMS (Dynamic Kernel Management System) before you install VirtualBox, you shouldn't need to recomiple VirtualBox after a kernel update. Ivan, the same principle applies for Nvidia drivers. If you install DKMS before you install the Nvidia drivers, you shouldn't have to re-install the Nvidia drivers after a kernel update. The only time I have to re-install Nvidia drivers after a kernel update is when I am running a pre-release (alpha or beta) version of the Ubuntu operating system. I don't have any experience with XeonPhi drivers, but the same principle might apply. It would certainly be worth a test. Hope that helps. ID: 1482 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 17 Aug 15 Posts: 62 Credit: 296,695 RAC: 0	Message 1483 - Posted: 18 Nov 2015, 18:55:54 UTC - in response to Message 1482. I have DKMS but I need gcc,make and kernel sources to recompile the kernel modules of VirtualBox. Tullio ID: 1483 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1484 - Posted: 18 Nov 2015, 20:57:23 UTC - in response to Message 1482. Ivan, the same principle applies for Nvidia drivers. If you install DKMS before you install the Nvidia drivers, you shouldn't have to re-install the Nvidia drivers after a kernel update. The only time I have to re-install Nvidia drivers after a kernel update is when I am running a pre-release (alpha or beta) version of the Ubuntu operating system. Hope that helps. Could well be so, I relied on the Nvidia installer the one or two times I tried it, it may not have got everything right. Not sure about Xeon Phi, section 2.7 of this manual. You weren't a Eurodance star in the 1990's were you? (Not really SFW, the Germans never really grasped the offensiveness of some English swear-words.) ID: 1484 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,277,916 RAC: 2,464	Message 1486 - Posted: 19 Nov 2015, 16:38:32 UTC - in response to Message 1452. I tried a re-boot but still the same. The CMS request takes a few seconds (5-10 ?) before it tries LHC, that takes no time at all before it puts out a curl command and then the 'Cloud not get' message. I see Yeti is getting this error now (he has reported it on the vLHC forum), any updates on it being fixed ? ID: 1486 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1513 - Posted: 23 Dec 2015, 1:04:57 UTC Last modified: 23 Dec 2015, 1:41:13 UTC "Cloud not get proxy..." (sic) I'm sure I've seen a post about this before, but can't find it. All my machines running CMS are stuck with this.:- It looks as though they've done nothing useful for a couple of days, so I've set NNW for now. Is this something I can do anything about? ID: 1513 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1514 - Posted: 23 Dec 2015, 16:25:19 UTC - in response to Message 1513. Only thing I can suggest at the moment is to reset the project on that machine, to get a fresh VM image. Sometimes things get corrupted, mainly I guess by network glitches as the cvmfs file-system is being updated. FWIW, in the latest batch (since 13/12/15) that machine has returned 7 success exits and one 151 (stage-out error). We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up. ID: 1514 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1515 - Posted: 23 Dec 2015, 20:41:38 UTC All of a sudden, all jobs, that have not been run,turned status to "unknown" on dashboard. Previously they were listed as status "Pending". Any idea, why? Shouldn't they have had the status "Unknown" to begin with? ID: 1515 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1516 - Posted: 23 Dec 2015, 23:10:20 UTC - in response to Message 1515. All of a sudden, all jobs, that have not been run,turned status to "unknown" on dashboard. Previously they were listed as status "Pending". Any idea, why? Shouldn't they have had the status "Unknown" to begin with? Well, if it's not been run it should be pending. Note that some of the "unknown" jobs have been run already. However, I've given up expecting Dashboard to give more than an approximation to reality while a batch is "live". There seem to be too many uncertainties for it to accurately interpret all the return codes. Those that have run but are marked as unknown appear to have timed out or some such; there is no job log, just the placeholder "Job output has not been processed by post-job." -- the Dashboard details give "N/A / Error return without specification". ID: 1516 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1517 - Posted: 24 Dec 2015, 2:20:12 UTC - in response to Message 1514. Thanks, Ivan. Only thing I can suggest at the moment is to reset the project on that machine, to get a fresh VM image. Sometimes things get corrupted, mainly I guess by network glitches as the cvmfs file-system is being updated. They've all got work from other projects at the moment, but forced one to get a new CMS BOINC task. This started OK without having to reset the project. The others should do this on their own eventually. Presumably they would have recovered on their own when the 24hr task time expired and they started afresh. This could take up to 4 days here. We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up. Maybe treat Dashboard-reported IPs for "jobs with retries" and IP-to-location with some suspicion, too. Once my ISP bought some v4 IPs from some outfit in the USA; as I remember it took quite a while before users could access "UK only" content. ID: 1517 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,367,744 RAC: 733	Message 1518 - Posted: 24 Dec 2015, 13:42:36 UTC - in response to Message 1517. We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up. Maybe treat Dashboard-reported IPs for "jobs with retries" and IP-to-location with some suspicion, too. Once my ISP bought some v4 IPs from some outfit in the USA; as I remember it took quite a while before users could access "UK only" content. I have actually dug the user and machine names from the end of the log-file (the line that says "FINISHING on user-machine-pid with status X") but for each "interesting" one I have to use my BOINC admin account to find what IP that machine last used, to report to the crew chasing down stage-out problems. ID: 1518 · Rating: 0 · rate: / Reply Quote

Development for LHC@home