Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1450 - Posted: 13 Nov 2015, 12:28:56 UTC - in response to Message 1449.  

I'll go and have a check, but I think all the time, ie once an hour on every machine. I can't check all of them as the console doesn't work on some and the boot log is the only file showing through the 'Graphics' button. I can check cpu usage though. I'll be back...
ID: 1450 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1451 - Posted: 13 Nov 2015, 12:40:50 UTC - in response to Message 1450.  

Yes, all machines running CMS are showing this error.
They are all happily running vLHC though.

Haven't received a CMS job through vLHC yet.
Any reason the LHC@home check doesn't say vLHC@home ?
ID: 1451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1452 - Posted: 13 Nov 2015, 16:02:10 UTC - in response to Message 1451.  

I tried a re-boot but still the same.

The CMS request takes a few seconds (5-10 ?) before it tries LHC, that takes no time at all before it puts out a curl command and then the 'Cloud not get' message.
ID: 1452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1469 - Posted: 16 Nov 2015, 15:31:55 UTC - in response to Message 1452.  

Still getting the 'Cloud not get' error.

After the last line in the boot.log file of Activating Fuse module the console display flashes this up quickly:

Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for Servername [OK]

Starting vmcontext_epilog ...
bootlogd: no process killed


It says OK, so is it ?
ID: 1469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1470 - Posted: 16 Nov 2015, 16:24:29 UTC - in response to Message 1469.  

Still getting the 'Cloud not get' error.

After the last line in the boot.log file of Activating Fuse module the console display flashes this up quickly:

Starting httpd: httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for Servername [OK]

Starting vmcontext_epilog ...
bootlogd: no process killed


It says OK, so is it ?

I believe so; certainly I've seen those messages with no obvious detrimental effect.
ID: 1470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1471 - Posted: 16 Nov 2015, 16:38:29 UTC - in response to Message 1470.  

That's what I'd thought I'd seen as well but after suffering withdrawal symptoms for so long you start to grasp at straws !
ID: 1471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1472 - Posted: 16 Nov 2015, 16:41:36 UTC

Yes, a normal running WU spottes so many Errors, that no normal Cruncher has a chance to find the real problem
ID: 1472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1479 - Posted: 18 Nov 2015, 8:51:17 UTC

Linux host (Ubuntu) - after today's OS update and restart:

CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)"

Downloading VBox 5.0.10 now (was on 5.0.8). Sigh.
ID: 1479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1480 - Posted: 18 Nov 2015, 9:06:50 UTC - in response to Message 1479.  

Linux host (Ubuntu) - after today's OS update and restart:

CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)"

Downloading VBox 5.0.10 now (was on 5.0.8). Sigh.

Well, that's quite usual for Linux. I have to re-run the Nvidia driver installer on my Linux machines with their GPUs every time there's a kernel update -- you're supposed to be able to set it up to do it automagically, but I've had problems with that. Also have to recompile the XeonPhi drivers after a kernel update and that's a whole other kettle of fish!
ID: 1480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1481 - Posted: 18 Nov 2015, 13:19:11 UTC - in response to Message 1480.  

I've installed Leap 42.1, the latest SuSE release, as a Virtual Machine on this Windows host. I installed BOINC client and manager from SuSE and also Virtual Box. Nothing works. Then I downloaded gcc, make and kernel sources from SuSE and am trying to rebuild all that works on OpenSuSE 13.1 and 13.2 on my Linux boxes.
Tullio
ID: 1481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 18 Aug 15
Posts: 14
Credit: 125,335
RAC: 0
Message 1482 - Posted: 18 Nov 2015, 15:22:16 UTC

Bill Michael said:

CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)"


If you install DKMS (Dynamic Kernel Management System) before you install VirtualBox, you shouldn't need to recomiple VirtualBox after a kernel update.

Ivan, the same principle applies for Nvidia drivers. If you install DKMS before you install the Nvidia drivers, you shouldn't have to re-install the Nvidia drivers after a kernel update. The only time I have to re-install Nvidia drivers after a kernel update is when I am running a pre-release (alpha or beta) version of the Ubuntu operating system.

I don't have any experience with XeonPhi drivers, but the same principle might apply. It would certainly be worth a test.

Hope that helps.
ID: 1482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1483 - Posted: 18 Nov 2015, 18:55:54 UTC - in response to Message 1482.  

I have DKMS but I need gcc,make and kernel sources to recompile the kernel modules of VirtualBox.
Tullio
ID: 1483 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1484 - Posted: 18 Nov 2015, 20:57:23 UTC - in response to Message 1482.  

Ivan, the same principle applies for Nvidia drivers. If you install DKMS before you install the Nvidia drivers, you shouldn't have to re-install the Nvidia drivers after a kernel update. The only time I have to re-install Nvidia drivers after a kernel update is when I am running a pre-release (alpha or beta) version of the Ubuntu operating system.
Hope that helps.

Could well be so, I relied on the Nvidia installer the one or two times I tried it, it may not have got everything right. Not sure about Xeon Phi, section 2.7 of this manual.

You weren't a Eurodance star in the 1990's were you? (Not really SFW, the Germans never really grasped the offensiveness of some English swear-words.)
ID: 1484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 6,191,450
RAC: 3,176
Message 1486 - Posted: 19 Nov 2015, 16:38:32 UTC - in response to Message 1452.  

I tried a re-boot but still the same.

The CMS request takes a few seconds (5-10 ?) before it tries LHC, that takes no time at all before it puts out a curl command and then the 'Cloud not get' message.

I see Yeti is getting this error now (he has reported it on the vLHC forum), any updates on it being fixed ?
ID: 1486 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1513 - Posted: 23 Dec 2015, 1:04:57 UTC
Last modified: 23 Dec 2015, 1:41:13 UTC

"Cloud not get proxy..." (sic)

I'm sure I've seen a post about this before, but can't find it.
All my machines running CMS are stuck with this.:-

It looks as though they've done nothing useful for a couple of days,
so I've set NNW for now.
Is this something I can do anything about?
ID: 1513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1514 - Posted: 23 Dec 2015, 16:25:19 UTC - in response to Message 1513.  

Only thing I can suggest at the moment is to reset the project on that machine, to get a fresh VM image. Sometimes things get corrupted, mainly I guess by network glitches as the cvmfs file-system is being updated.
FWIW, in the latest batch (since 13/12/15) that machine has returned 7 success exits and one 151 (stage-out error). We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up.
ID: 1514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1515 - Posted: 23 Dec 2015, 20:41:38 UTC

All of a sudden, all jobs, that have not been run,turned status to "unknown"
on dashboard.
Previously they were listed as status "Pending".

Any idea, why?
Shouldn't they have had the status "Unknown" to begin with?
ID: 1515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1516 - Posted: 23 Dec 2015, 23:10:20 UTC - in response to Message 1515.  

All of a sudden, all jobs, that have not been run,turned status to "unknown" on dashboard. Previously they were listed as status "Pending".

Any idea, why? Shouldn't they have had the status "Unknown" to begin with?

Well, if it's not been run it should be pending. Note that some of the "unknown" jobs have been run already. However, I've given up expecting Dashboard to give more than an approximation to reality while a batch is "live". There seem to be too many uncertainties for it to accurately interpret all the return codes.
Those that have run but are marked as unknown appear to have timed out or some such; there is no job log, just the placeholder "Job output has not been processed by post-job." -- the Dashboard details give "N/A / Error return without specification".
ID: 1516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1517 - Posted: 24 Dec 2015, 2:20:12 UTC - in response to Message 1514.  

Thanks, Ivan.

Only thing I can suggest at the moment is to reset the project on that machine, to get a fresh VM image. Sometimes things get corrupted, mainly I guess by network glitches as the cvmfs file-system is being updated.

They've all got work from other projects at the moment, but forced one to get a new CMS BOINC task. This started OK without having to reset the project. The others should do this on their own eventually. Presumably they would have recovered on their own when the 24hr task time expired and they started afresh. This could take up to 4 days here.

We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up.

Maybe treat Dashboard-reported IPs for "jobs with retries" and IP-to-location with some suspicion, too. Once my ISP bought some v4 IPs from some outfit in the USA; as I remember it took quite a while before users could access "UK only" content.
ID: 1517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 270
Message 1518 - Posted: 24 Dec 2015, 13:42:36 UTC - in response to Message 1517.  

We're trying to chase the stage-out errors; there is some correlation with distance from CERN apparently. I've just got the exit status from 3300-odd jobs in the current batch and counted the different statuses for each machine -- I might try to tie that in with IP but it involves a laborious manual look-up.

Maybe treat Dashboard-reported IPs for "jobs with retries" and IP-to-location with some suspicion, too. Once my ISP bought some v4 IPs from some outfit in the USA; as I remember it took quite a while before users could access "UK only" content.

I have actually dug the user and machine names from the end of the log-file (the line that says "FINISHING on user-machine-pid with status X") but for each "interesting" one I have to use my BOINC admin account to find what IP that machine last used, to report to the crew chasing down stage-out problems.
ID: 1518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN