Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1384 - Posted: 30 Oct 2015, 20:20:10 UTC - in response to Message 1382.  

My task 68516, work unit 63658, took 3 days to complete - spent most of it's time in "Waiting to run" state in spite of CMS-Dev being my highest BOINC priority project. Looking at the log, it is mostly

2015-10-27 21:46:47 (608): Status Report: virtualbox.exe/vboxheadless.exe is no longer running.

Looks like when this happens the task attempts a retry MANY hours after the failure; don't understand VBox at all (can't get it to run on my Mac, where I do know what I'm doing, unlike Windows...).

Would be nice if either was notified something was wrong so it could be fixed on my end, or if task would just abort with an error, rather than keeping that CPU from running CMS indefinitely. (Until reboot?) That machine (Phoenix) was overheating with a GPU problem, fixed yesterday, or it might still be cycling the same task...

I can't comment explicitly on the VBox error, I'm not an expert (though some here are). There are probably ways of monitoring this (I see your machine is on Windows 10): looking at the tasks in boincmgr (it would probably have gone into a wait state; BTW, I sort my mgr task display by % Progress, so jobs that have had at least some running are at the top of the list); or in Task Manager details looking to see if VBoxHeadless has been consuming CPU -- but that depends on whether you're a set-'n'-forget or a micro-managing type.

I don't see any jobs for that machine in the 100-event batch, but it did snag a few of the 50-event tasks (status 0 is good):
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.17.1.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 04:28:47 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.283.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 16:51:43 GMT 2015 on 306-749-11603 with (short) status 151 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.315.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 08:00:52 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.347.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 09:55:04 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.369.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 11:40:18 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.395.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 13:20:30 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.421.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 15:15:27 GMT 2015 on 306-749-11603 with (short) status 0 ========
151026_105149:ireid_crab_CMS_at_Home_TTbar_50ev/job_out.93.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Oct 27 06:18:32 GMT 2015 on 306-749-11603 with (short) status 0 ========


As an aside to everyone, if you're reporting issues like this and you have multiple hosts, it would help if you gave the host number and not its name, to save us having to use privileges to look this up. The machine designation given in the above lines is userno-hostno-something so if I know the user and host numbers I can look for the machine in our logs without having to do the extra step of searching databases as well.
ID: 1384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1385 - Posted: 30 Oct 2015, 21:46:12 UTC - in response to Message 1384.  

BOINC manager Task pane showed "waiting to run" - which is not unusual for any given task as I run a lot of projects and BOINC switches things out based on it's own priorities. No indication in Tasks or Projects pane of any problem. Messages pane did show that the task was 'timed out' for 860000-something seconds - sorry I didn't record the wording, I know better! When I went back to look, the message was already gone. I decided to let it sit as the alternative was to abort it. I tried suspending/resuming task and project to get it to restart, nothing changed. At some point (possibly reboot due to GPU problem, when I wasn't looking at BOINC, timing seems right) it went back to 'running' and finished up fine.

Sorry about not giving computer # - when I look at a work unit on the website, the computer # is right there, I assumed it was the same at your end.

Please advise where to look in "Task Manager Details" for vbox usage. I use TThrottle on Windows machines, which shows a very low CPU percentage for the 'parent' task but shows the actual CPU usage for the 'child' task on CMS (unlike BOINC, which just shows the meaningless 'parent' value...) - but this only applies to _running_ tasks, not "waiting to run". I'm not sure there is anywhere TO look to determine why something is NOT running, other than the BOINC message file? Or how to know that looking is even necessary?

BTW, just successfully attached Linux box (782, very low-horsepower) to project, downloading work now, we'll see how it runs. No problem with VBox, unlike Mac (711) which is STILL not working. (Q&A section posts). I'm definitely a "micro-managing" type, but I'm in a "BOINC Addicts Anonymous" 12-step program to try to get to "set and forget"... :-) Seriously, I USE the Mac - my other hosts are "under construction", running BOINC for testing and burnin, and only get checked every day or two. When they are 'finished' and go into daily use, BOINC will still be there but obviously not running 24/7. They _may_ not be suitable for CMS at that point, we'll see.

When looking at my tasks (user 306), oddly, hosts 748 and 749 (AMD 8350's) show CPU seconds in the 8000-60000 range for each task. Host 775 (i7-5820K!) shows around _5_ seconds? Is this something to be concerned about? All the tasks are validated, and actually grant MORE credit per task for these 5-second runs... ??? Maybe these were running when there weren't any 'jobs' available for the tasks? When you're ready to discuss credit issues, I was around for the Sztaki, Predictor, and Rosetta startups (as well as BOINC itself) and have been project manager on several non-BOINC IT projects, and definitely have some ideas... although I _still_ feel like a BOINC "noob" at times... and am barely Windows and Linux literate...
ID: 1385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1386 - Posted: 30 Oct 2015, 22:41:28 UTC - in response to Message 1385.  
Last modified: 30 Oct 2015, 22:53:27 UTC

BOINC manager Task pane showed "waiting to run" - which is not unusual for any given task as I run a lot of projects and BOINC switches things out based on it's own priorities. No indication in Tasks or Projects pane of any problem. Messages pane did show that the task was 'timed out' for 860000-something seconds - sorry I didn't record the wording, I know better! When I went back to look, the message was already gone. I decided to let it sit as the alternative was to abort it. I tried suspending/resuming task and project to get it to restart, nothing changed. At some point (possibly reboot due to GPU problem, when I wasn't looking at BOINC, timing seems right) it went back to 'running' and finished up fine.
Was there any extra info when you moved the mouse over the task? I have a Windows machine at work that refuses, for some reason, to start up the VM properly. So it appears to run, for 10 minutes, and then goes into "waiting to run" but when I mouse-over the entry I get a tool-tip box saying something like "Task failed to run in a timely manner. Waiting." The giveaway in that case is recorded running time of ~10 minutes.

Sorry about not giving computer # - when I look at a work unit on the website, the computer # is right there, I assumed it was the same at your end.
Tja, but non-privileged users can't see the name or the IP address of anyone else's computers. I have to log-in to a privileged part of the site to see that info, and even then I have to search a bit.

Please advise where to look in "Task Manager Details" for vbox usage. I use TThrottle on Windows machines, which shows a very low CPU percentage for the 'parent' task but shows the actual CPU usage for the 'child' task on CMS (unlike BOINC, which just shows the meaningless 'parent' value...) - but this only applies to _running_ tasks, not "waiting to run". I'm not sure there is anywhere TO look to determine why something is NOT running, other than the BOINC message file? Or how to know that looking is even necessary?
To be honest, that's one of my quibbles with Task Manager on Win10 -- I've not found any way to get it to display accumulated CPU time, only current %age (there's no menu item for selecting what statistics columns are shown, that I've ever found).

BTW, just successfully attached Linux box (782, very low-horsepower) to project, downloading work now, we'll see how it runs. No problem with VBox, unlike Mac (711) which is STILL not working. (Q&A section posts). I'm definitely a "micro-managing" type, but I'm in a "BOINC Addicts Anonymous" 12-step program to try to get to "set and forget"... :-) Seriously, I USE the Mac - my other hosts are "under construction", running BOINC for testing and burnin, and only get checked every day or two. When they are 'finished' and go into daily use, BOINC will still be there but obviously not running 24/7. They _may_ not be suitable for CMS at that point, we'll see.
Yes, that's a downside to this project. We need -- almost demand -- computers be on and available 24/7 (one reason we only send one VM per host), plus we need good bandwidth to avoid the apparent time-outs we've been seeing lately with 100-event jobs returning 128 MB result files. The jury's still out on this on as there have been glitches at CERN in the last 48 hours, but 75-event jobs with ~100 MB result files seem to be faring much better at the moment.

When looking at my tasks (user 306), oddly, hosts 748 and 749 (AMD 8350's) show CPU seconds in the 8000-60000 range for each task. Host 775 (i7-5820K!) shows around _5_ seconds? Is this something to be concerned about? All the tasks are validated, and actually grant MORE credit per task for these 5-second runs... ??? Maybe these were running when there weren't any 'jobs' available for the tasks? When you're ready to discuss credit issues, I was around for the Sztaki, Predictor, and Rosetta startups (as well as BOINC itself) and have been project manager on several non-BOINC IT projects, and definitely have some ideas... although I _still_ feel like a BOINC "noob" at times... and am barely Windows and Linux literate...
Ah, that's something about the way CERN IT has decided to run their BOINC tasks. Each task runs for 24 hours, and gets a fixed amount of credit for that. Within the task, jobs are run -- or not as the case may be. For our project (caution, here be jargon!) a glide-in request is sent to the Condor server to which I have submitted a batch of jobs (using CRAB). If the request is successful a job is returned and crunched, and the results hopefully sent ("staged out") to the data bridge machine that stores them pending transfer into the internal CMS storage system. If no job is found for some reason (like the several hiatuses I've had lately when things have broken on the CERN side) then IIRC after a while the glide-in request times out, and a new one is submitted. Each glide-in job can retrieve several jobs in its lifetime, as you may have noticed if you've followed the logging tree from the curiously misnamed "Show Graphs" button in boincmgr. So it's perfectly possible that you never ran any jobs in that 24-hour task, so didn't rack up much CPU usage, but still got credit for it (because your CPU was not available for other BOINC jobs).
There's a downside to this that has been pointed out, and will need addressing at some point soon, that when the 24-hour period is up, the task terminates and any running job is lost, leading to inefficiency.
ID: 1386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1387 - Posted: 31 Oct 2015, 1:16:42 UTC - in response to Message 1386.  

May have helpful new information. Just rebooted Phoenix (749) for Windows update at 19:35 CDT. Before reboot all was running normally. At 19:46 CMS-Dev task CMS_10671_1427806902.405909_0 went to "waiting to run" state in BoincTasks. Message file shows "Task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion."

Launched BOINC Manager, instead of "waiting to run", _it_ says "Postponed: VM Hypervisor failed to enter an online state in a timely fashion." (Bit more helpful, that!) Properties says same thing, with 10+ hrs CPU and 15+ hrs elapsed time.

Website info on task; #68770, work unit 63912, user 306, host 749 (Win10 AMD 8350).

Apparently have 24 hours before anything changes - if you want me to look at something on my end, let me know, happy to help!
ID: 1387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1392 - Posted: 31 Oct 2015, 10:30:01 UTC - in response to Message 1387.  
Last modified: 31 Oct 2015, 10:35:13 UTC

May have helpful new information. Just rebooted Phoenix (749) for Windows update at 19:35 CDT. Before reboot all was running normally. At 19:46 CMS-Dev task CMS_10671_1427806902.405909_0 went to "waiting to run" state in BoincTasks. Message file shows "Task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion."

Launched BOINC Manager, instead of "waiting to run", _it_ says "Postponed: VM Hypervisor failed to enter an online state in a timely fashion." (Bit more helpful, that!) Properties says same thing, with 10+ hrs CPU and 15+ hrs elapsed time.

Website info on task; #68770, work unit 63912, user 306, host 749 (Win10 AMD 8350).

Apparently have 24 hours before anything changes - if you want me to look at something on my end, let me know, happy to help!

That is the exact same thing that happens to my Windows box at work. If you ever manage to find out what causes it, do let me know, because I've looked hard and long without success. What version of VirtualBox are you running? The version that BOINC serves out works, but it's old (4.3.12 IIRC).
I notice that you had got some work done before it locked up [http://boincai05.cern.ch/CMS-dev/result.php?resultid=68770]. Mine always bombs at start-up. There may be clues in that log for anyone more au fait with the workings of VirtualBox.
ID: 1392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,963,281
RAC: 12,846
Message 1393 - Posted: 31 Oct 2015, 11:35:32 UTC - in response to Message 1392.  

Have just found this in the bottom of a condor_stdout file on RemoteHost = "glidein_3942@246-485-10399" ...


==== CMSSW JOB Execution started at Sat Oct 31 11:00:05 2015 ====
2015-10-31 11:00:06,577:INFO:CMSSW:User files are
2015-10-31 11:00:06,577:INFO:CMSSW:User sandboxes are sandbox.tar.gz
2015-10-31 11:00:06,577:INFO:CMSSW:CMSSW configured for 1 cores
2015-10-31 11:00:06,577:INFO:CMSSW:Executing CMSSW step
2015-10-31 11:00:06,577:INFO:CMSSW:Runing SCRAM
2015-10-31 11:00:09,215:INFO:CMSSW:Running PRE scripts
2015-10-31 11:00:09,216:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2015-10-31 11:00:09,216:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_WH3qGZ/execute/dir_13481/TweakPSet.py --location=/home/boinc/CMSRun/glide_WH3qGZ/execute/dir_13481 --inputFile='job_input_file_list_418.txt' --runAndLumis='job_lumis_418.json' --firstEvent=31276 --lastEvent=31351 --firstLumi=418 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2015-10-31 11:00:10,316:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_WH3qGZ/execute/dir_13481/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
2015-10-31 11:02:03,999:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/home/boinc/CMSRun/glide_WH3qGZ/execute/dir_13481/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']}
Return code: -9

/home/boinc/CMSRun/glide_WH3qGZ/execute/dir_13481/condor_exec.exe: line 12: 13624 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode


It started another job and is happily processing now.
Let me know if you need anything else.
ID: 1393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 37
Message 1394 - Posted: 31 Oct 2015, 11:45:08 UTC - in response to Message 1392.  
Last modified: 31 Oct 2015, 12:19:17 UTC

May have helpful new information. Just rebooted Phoenix (749) for Windows update at 19:35 CDT. Before reboot all was running normally. At 19:46 CMS-Dev task CMS_10671_1427806902.405909_0 went to "waiting to run" state in BoincTasks. Message file shows "Task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion."

Launched BOINC Manager, instead of "waiting to run", _it_ says "Postponed: VM Hypervisor failed to enter an online state in a timely fashion." (Bit more helpful, that!) Properties says same thing, with 10+ hrs CPU and 15+ hrs elapsed time.

Website info on task; #68770, work unit 63912, user 306, host 749 (Win10 AMD 8350).

Apparently have 24 hours before anything changes - if you want me to look at something on my end, let me know, happy to help!

That is the exact same thing that happens to my Windows box at work. If you ever manage to find out what causes it, do let me know, because I've looked hard and long without success. What version of VirtualBox are you running? The version that BOINC serves out works, but it's old (4.3.12 IIRC).
I notice that you had got some work done before it locked up [http://boincai05.cern.ch/CMS-dev/result.php?resultid=68770]. Mine always bombs at start-up. There may be clues in that log for anyone more au fait with the workings of VirtualBox.

First off, I'm no expert; on anything. Casting the mind back to the early
days of t4t (as vLHC then was) there were sometimes failures when starting BOINC
on boot due to the other tasks, generally HD intensive, going on at the time.
For me it was AV updates and the associated downloads, Win updates would make
things even worse. Things were much improved by delaying BOINC startup after boot
and by delaying the start of computation after starting BOINC. This was (still is) with slow old PCs,
presumably similarly slow HDDs and normal UK indifferent ADSL speeds and older
software, so might not work for you.
BOINC is started after 15 secs (from a batch file) and computation (and hence VBox) after a further
10 seconds using the "start_delay" option in the cc_config file. All the hosts here (I think...) are set
up similarly, winxp, win7 and linuxes (don't have win10)
ID: 1394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,963,281
RAC: 12,846
Message 1395 - Posted: 31 Oct 2015, 13:25:58 UTC - in response to Message 1394.  

Job 192 of the 75 event run got started and then finished very quickly, can't see an error apart from the job only lasted 2 seconds !

Start:
======== gWMS-CMSRunAnalysis.sh STARTING at Fri Oct 30 20:20:00 GMT 2015 on 246-472-2311 ========
Local time : Fri Oct 30 20:20:00 GMT 2015
Current system : Linux 246-472-2311 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux
Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=192 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_192.txt --runAndLumis=job_lumis_192.json --lheInputFiles=False --firstEvent=14326 --firstLumi=192 --lastEvent=14401 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
SCRAM_ARCH=slc6_amd64_gcc472
======== HTCONDOR JOB SUMMARY at Fri Oct 30 20:20:00 GMT 2015 START ========
CRAB ID: 192
Execution site: Volunteer
Current hostname: 246-472-2311
Destination temp dir: /store/temp/user/ireid.17433470b3faad006e8120ad843d39e3666b08f0/CMS_at_Home/CRAB3_TTbar/151030_103550
Output files: step1.root=step1_192.root
==== HTCONDOR JOB AD CONTENTS START ====
== JOB AD: CRAB_ASOTimeout = 86400
== JOB AD: PreJobPrio1 = 1
== JOB AD: DAGNodeName = "Job192"


...End of condor_stdout:
==== HTCONDOR JOB AD CONTENTS FINISH ====
======== HTCONDOR JOB SUMMARY at Fri Oct 30 20:20:01 GMT 2015 FINISH ========
======== PROXY INFORMATION START at Fri Oct 30 20:20:01 GMT 2015 ========
Job Running time in seconds: 2
Job runtime is less than 20minutes. Sleeping 1198
subject : /O=Volunteer Computing/O=CERN/CN=PDW 246/CN=1084901593
issuer : /O=Volunteer Computing/O=CERN/CN=PDW 246
identity : /O=Volunteer Computing/O=CERN/CN=PDW 246
type : RFC3820 compliant impersonation proxy
strength : 1024
path : /home/boinc/CMSRun/glide_5Q0ADa/ticket/myproxy
timeleft : 125:11:03
key usage :
ID: 1395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1396 - Posted: 31 Oct 2015, 16:17:01 UTC

Every time Windows 10 reboots the percentage goes back to 1% after 20 hours of crunching.
Tullio
ID: 1396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1397 - Posted: 31 Oct 2015, 16:35:13 UTC - in response to Message 1395.  

That's "interesting". In fact that job returned an empty log file:
[cms005@lcggwms02:~] > ls -l 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.192*
-rw-r--r-- 1 cms005 cms005 0 Oct 30 20:32 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.192.0.txt
-rw-r--r-- 1 cms005 cms005 101760 Oct 30 22:30 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.192.1.txt

but a the retry succeeded. Also interesting is that that machine has had a couple of errors:
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.264.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Fri Oct 30 23:26:45 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.264.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Oct 31 02:16:56 GMT 2015 on 246-472-2311 with (short) status 0 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.342.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Sat Oct 31 02:17:02 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.342.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Oct 31 06:04:38 GMT 2015 on 246-472-2311 with (short) status 151 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.402.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Sat Oct 31 06:09:44 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.402.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Oct 31 09:02:29 GMT 2015 on 246-472-2311 with (short) status 0 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.456.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Sat Oct 31 09:02:34 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.456.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Oct 31 11:54:29 GMT 2015 on 246-472-2311 with (short) status 0 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.518.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Sat Oct 31 12:00:58 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.518.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Oct 31 15:17:45 GMT 2015 on 246-472-2311 with (short) status 0 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.76.0.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Fri Oct 30 16:23:53 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.76.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Oct 30 20:19:45 GMT 2015 on 246-472-2311 with (short) status 151 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.76.1.txt:======== gWMS-CMSRunAnalysis.sh STARTING at Fri Oct 30 20:22:17 GMT 2015 on 246-472-2311 ========
151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev/job_out.76.1.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Oct 30 23:20:59 GMT 2015 on 246-472-2311 with (short) status 0 ========

but look at the last two there, it successfully retried a job it had already failed. Something suspicious about the failure, its times overlaps your job 192 failure. It took nearly three hours to finish:
real 174m28.773s
user 134m45.239s
sys 1m4.047s
CMSRunAnalysis.sh complete at Fri Oct 30 19:18:44 GMT 2015 with (short) exit status 0

It looks like it timed out on file transfer twice before giving up. That's starting to suggest that the result files are still too big for your ADSL connection.
ID: 1397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1400 - Posted: 31 Oct 2015, 22:49:39 UTC
Last modified: 31 Oct 2015, 22:53:25 UTC

I want to run the queue down as far as possible before submitting a new batch -- not sure exactly when that will be, with a bit of luck after I get up on Sunday (we currently have 643 75-event results returned, which corresponds well with Dashboard's outstanding-job count). The 100-event batch is a mess, but I'm still not sure if that's because the output got too big for network timeouts or because of the recent network/authentication glitches at CERN. Given that it's continuing, the former seems more likely.
ID: 1400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1401 - Posted: 1 Nov 2015, 3:56:43 UTC
Last modified: 1 Nov 2015, 3:59:28 UTC

New batch of 50-event jobs submitted, to compare with the last and see if recent problems were due to job size or network problems. Didn't quite drain the queue, but after staying up to watch qualifying for the Mexican GP on BBC, I'd better get some sleep now... (Currently local time == GMT.)
ID: 1401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1405 - Posted: 1 Nov 2015, 18:17:36 UTC - in response to Message 1392.  

That is the exact same thing that happens to my Windows box at work. If you ever manage to find out what causes it, do let me know, because I've looked hard and long without success. What version of VirtualBox are you running? The version that BOINC serves out works, but it's old (4.3.12 IIRC).
I notice that you had got some work done before it locked up [http://boincai05.cern.ch/CMS-dev/result.php?resultid=68770]. Mine always bombs at start-up. There may be clues in that log for anyone more au fait with the workings of VirtualBox.


No new clues here. LHC message boards show quite a few of the same issue, seems to be all with Windows 10. (No resolutions, and the one "patch" recommended just killed one of my CMS tasks instead of resurrecting it.) My guess is that VBox (at least the version that comes with BOINC, which is what I'm running) and Win10 don't get along very well. Yet another reason why using VBox is, in spite of some advantages, not a great idea IMHO. The main problem of course is blocking yourselves off from using GPU computing... I realize that decision was made at a much higher CERNish level...

My Linux box meanwhile just (slowly) returned it's first CMS task, no problem.

Maybe silly question, maybe I misunderstand the project (there's not a lot of info to be found, which is expected at this stage). You have "tasks" that run on the host for 24 hours. That "task" runs "jobs", which are much shorter in duration (loaded with the task? retrieved as available?). At the end of each "job", nothing is sent back to you. At the end of the "task", you get all the data.

Meanwhile, BOINC runs "BOINCmgr", that runs on the host 24/7. That program runs "tasks", which are much shorter in duration. These are retrieved from the project as available. At the end of each "task", the data is sent back to the project. Unless I'm missing something, your "task/job" split is simply replicating what BOINC already does. I don't understand why each "job" is not a single "task". The only advantage I see is in getting your results in blocks of 50/75/100/whatever instead of one at a time. And you create the must-run-non-stop issue, and the what-about-credit issue, and the must-remain-reliable-for-24-hours issue, and... Would it not make a lot more sense to bundle "x" jobs (whether that be 1 or 100) into a task and have the task exit when the jobs are finished? If there is some reasoning behind your approach that I'm missing, great - but that reasoning is exactly what should be "front and center" on the project website! Otherwise, when you go "live", you're going to have a whole lot of people just as confused as I am.

I personally use "credits" just to balance my project workload. Bitcoin has .01% of my current CPU time (funding BOINCStats) while CMS has 15%. This may just reward projects that don't give much credit, but there are plenty of BOINC "credit chasers" who do things just the other way around. I realize it's too early to be worried VERY much about how you'll handle credits, but doing away with the 24-hour tasks would sure simplify things. I'm afraid to be "competitive" in the credit arena, you'll have to give a bazillion credits per task and just become another BitcoinUtopia, in order to get the number of hosts you want. The average "set and forget" BOINC user is not going to return very many good results with your current setup, will get very little credit, and will quickly move on to other projects.

I'm sure all of these things have been considered - but since I can't easily locate the data, I figured I'd throw in my 2 credits worth! :-)
ID: 1405 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1406 - Posted: 1 Nov 2015, 18:24:38 UTC

One other thought - much shorter "tasks" would allow you to lift the one-task-per-host limit and get you a lot more cores executing at any given time. BOINC is already swapping out your task on my hosts as other projects get higher priority, so there's no gain to one-task since you can't enforce it remaining in a running state... I'm playing games with my hosts, suspending projects and such, to try to keep CMS running and sending returns as close to every-24 as I can. The "average BOINCer" is more likely to return one task a week, if you're lucky.
ID: 1406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,972,027
RAC: 2,920
Message 1407 - Posted: 1 Nov 2015, 19:50:16 UTC - in response to Message 1405.  

Laurence and Ben could give you the best answers here, I came relatively late to the project, when most of the decisions had been made (for better or for worse). AFAIK, the decision to have long tasks running smaller jobs was mainly made to simplify accounting -- you get so much per task whetever jobs get run. There is also the consideration of trying to unify the CMS experiments; we will eventually move to a centralised system where you can chose to run jobs for vLHC, CMS, perhaps ATLAS and maybe LHCb and ALICE. In fact vLHC users can already chose to run CMS too.
I believe it would in fact be easier to have our own run management scheme, but the decision was made to tap into the BOINC community, so that came with its own constraints.
Bringing all the issues you raise onto one project or Twiki page is on my "to do" list, but in all honesty my involvement was only supposed to be "one or two hours a week", but it's turned into more like 24/7. I do have other work to do, and will have to give some of that priority in the near future. Meanwhile I'll try to keep the jobs flowing, and answer what questions I can, when I can.
ID: 1407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1408 - Posted: 1 Nov 2015, 21:17:35 UTC - in response to Message 1407.  

Bringing all the issues you raise onto one project or Twiki page is on my "to do" list, but in all honesty my involvement was only supposed to be "one or two hours a week", but it's turned into more like 24/7.


Understand completely. "Hey Bill, the family business needs to upgrade our computers. Your background is in IT..." (Mainframes and Macs, NOT Windows or Linux!) And here I sit with a half-dozen partially built systems a month later... building racks today! :-/ The fact that I've never dealt with GPUs before and half of ours keep overheating isn't helping. At least BOINC is pointing out the problem before it gets to a user who will never look at the temperatures. The "low end" R9-280s seem just as bad as the GTX 780Ti's too. Nobody wants to pay for _new_ GPUs. Or water cooling, although I insisted on AIOs for the CPUs at least. I'm still in the "swap that GPU into that case and see if it runs any better" stage, (and throwing in more case fans) as any of the GPUs would be fine for our purposes in any of the systems. All that matters is having the right CPU horsepower in the right place. If I wind up with a 280 in the 5820K box, and a 780Ti in an AMD, it won't matter. They're all overpowered for what we do, which may make all this worrying about temperatures unnecessary. Suspect somebody bought the 780s to play games after hours...

I'm sure _you_ get what I'm saying about projects and credits - just look at your own sig and compare your SETI/Einstein to all the CERN stuff. !!! I'm happy to help with whatever CPU time and message board time I can give; as I've said, I remember when Rosetta, Predictor, and Sztaki were starting up. And when SETI moved to BOINC, for that matter. Talk about confusion due to poor communication! The communication to the volunteers is where MOST projects fall down badly. Rosetta is probably the best-communicating project out there (or used to be, haven't looked lately, haven't needed to). It's a tossup on who is the worst - figuring out how to actually do _anything_ with Bitcoin is almost impossible because the "miners" speak their own jargon, while the CERN projects have great LOOKING websites that you can't find anything on... and when you do, it's a link out to Oracle's VBox support site, which is even worse. Sigh.

I'll keep crunching CMS tasks in the hopes that they'll help out somehow. I signed on with you instead of any of the other CERN projects _because_ you were a "startup". If anybody has a functioning Mac on CMS and could go to the Q&A section to give me a hand, I'll add a Mac or two to the farmlet.
ID: 1408 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1410 - Posted: 2 Nov 2015, 10:16:51 UTC

NEW problem! Task 68947 (user 306, work unit 64089, host 748) ran it's 22 hours and completed, 15 hrs CPU time... but is showing on the website as "error while computing" and "194 (0xc2) EXIT_ABORTED_BY_CLIENT". There are some strange lock file errors in the log, but they're a half-hour before it called boinc_finish and then for some reason fired VBox up again?!?
ID: 1410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,145
RAC: 94
Message 1411 - Posted: 2 Nov 2015, 10:39:35 UTC - in response to Message 1408.  

Hi Bill,

The LHC@home projects use virtualization as a way to avoid the need to port the application software to Windows which provides around 85% of the volunteer resources. BOINC is used as it is trusted by the volunteers and already has an established community. We have worked closely with the BOINC team over the past few years to ensure that BOINC can support projects that require virtualization.

One of the issues is that BOINC credit is only assigned at the end of a task and as volunteers who care about credit like to see this increase regularly, the VM is restarted every 24 hours so that the credit can be assigned. We could increase this time but it is also good to periodicity restart the VM occasionally and it is necessary in the case of updates. We realize that the use of VMs is an extra barrier to entry and will result in less resources but it is our only option. We really hope that we can make this easier in the future. For example, checkout the CERN Public Computing Challenge that enables you to control that VM via a Web browser.

https://test4theory.cern.ch/challenge/

As you pointed out, each task runs multiple jobs but the data is uploaded after each job. The reason why a task does not run a single job is to improve efficiency as there is an overhead of booting the OS. Also as we use a custom HTTP-based caching file system call CVMFS, there is a network overhead associated with the first job and the subsequent ones benefit. This approach is identical to how CMS is running on standard cloud resources such as OpenStack. In fact we have now started calling this project 'the volunteer cloud'.

GPU computing is on our radar and something that may be important for us in the future. The future support for GPUs in VBox is unclear and something that we would need to clarify if we head in this direction.

We understand that the most important aspect of volunteer computing is the volunteers themselves who are donating their resources and hence the importance of good communication. This project is currently in development and hence not at the level of maturity of projects such as Seti/Einstein/Rosetta etc. but we hope to get there. If there are aspects that you think we can improve, please let us know.

Thanks,

Laurence
ID: 1411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1418 - Posted: 2 Nov 2015, 18:54:32 UTC - in response to Message 1411.  

One of the issues is that BOINC credit is only assigned at the end of a task and as volunteers who care about credit like to see this increase regularly, the VM is restarted every 24 hours so that the credit can be assigned. We could increase this time but it is also good to periodicity restart the VM occasionally and it is necessary in the case of updates.


I think the biggest problem you're going to have when you go live is that 24-hour choice. Your reply DID clear up some of my questions, thanks for that! IMHO, if you dropped from 24 to 12, or better yet, 6 or 4 hours, you'd solve a lot of problems - probably worth the overhead cost. On "normal" PCs, it is very likely that your task is going to be swapped out, often, and take a whole lot more than 24 hours "wall time". That vastly increases the chances of something going wrong - reboots, power failures, user error, network outages, systems turned off for a few days... Failed tasks are not a big deal for the project - just resend them. But for the user, they are very frustrating. Realize that when you say "24 hours", you aren't talking wall time, but "slot time" executing on the host - with CPU time within that being dependent on the number of jobs executed. 24 hours is a VERY long time compared to other BOINC projects. I'm guessing, but I'd say the average is much less that one hour per task! Yoyo is the only other one _I'm_ running that regularly ties up my slower systems for day after day. It's quite annoying.

We realize that the use of VMs is an extra barrier to entry and will result in less resources but it is our only option. We really hope that we can make this easier in the future. For example, checkout the CERN Public Computing Challenge that enables you to control that VM via a Web browser.

https://test4theory.cern.ch/challenge/


Looked, downloaded, it's running VBox on my Mac no problem. Unlike BOINC! :-/ I wouldn't say VM is your _only_ option... I realize the manpower involved, but there are development environments that can create cross-platform applications with relative ease - you don't have to rewrite everything for each OS, especially since you have no UI to worry about with BOINC. I'd hate to lose the Mac option, but realistically BOINC is large-majority Windows; I don't think you're going to get much power from phones and tablets and such. So, if VM turns out to not work well, you're only looking at two compilations for each source change to get Linux and Windows. If you do go GPU, you're looking at multiple versions anyway...


As you pointed out, each task runs multiple jobs but the data is uploaded after each job. The reason why a task does not run a single job is to improve efficiency as there is an overhead of booting the OS. Also as we use a custom HTTP-based caching file system call CVMFS, there is a network overhead associated with the first job and the subsequent ones benefit. This approach is identical to how CMS is running on standard cloud resources such as OpenStack. In fact we have now started calling this project 'the volunteer cloud'.


If data is uploaded after each job, you have a REAL credit issue. Simply that you don't award partial credit for tasks that do "n" jobs before failing! You've used up the hosts CPU time without "payment"... The way most volunteers (I think) look at BOINC is that you're "renting" their CPU time and paying in "credits". Sure, some will donate their time to a project of interest without caring about credit, but many (most?) do like to see those credits climb, at a _reasonable_ rate (ahem, Bitcoin, ahem...). You're going to have to come up with some way to give a "base" amount for "slot rent", plus an amount for each job done. (Wall time plus CPU time somehow.) The more jobs/task, the larger this issue is going to become. Credit though is not the important point right now, getting tasks to run successfully to completion is, and I'm having problems with that even though I'm micromanaging everything right now.

Again I'm unclear on whether a task has all it's "jobs" at initiation or not... if it does, then you gain nothing by running on and on after the jobs are done? Simply exiting when complete might solve a lot of issues. Forget "24 hours" and just track "how many jobs per task" give the best result, balancing overhead with returned-valids. Would eliminate the "last job cut off" issue as well, when the time is up. I would think it would make things easier on your end as well!

And, you could offer say, three "applications" - "short", "medium", and "long", that the users could pick from under preferences. That'd be something like 50/75/100 jobs/task. Always nice to give the user options.

GPU computing is on our radar and something that may be important for us in the future. The future support for GPUs in VBox is unclear and something that we would need to clarify if we head in this direction.


GPU computing is of course optional for any project. I don't know the structure of your algorithms, so I don't know if it would give you a big boost or not. Thus this is something that only you can decide. The point of the project is, at bottom, to produce useful work for CMS, not just to keep credit-chasers happy.

We understand that the most important aspect of volunteer computing is the volunteers themselves who are donating their resources and hence the importance of good communication. This project is currently in development and hence not at the level of maturity of projects such as Seti/Einstein/Rosetta etc. but we hope to get there. If there are aspects that you think we can improve, please let us know.


Website updates! Take what has come up in the forums as issues (VBox, credit, task-vs-jobs, etc.) and explain those things thoroughly, on or near the home page. Users shouldn't have to wade through the forums to find things like your post above to learn what is going on. This will VASTLY reduce the amount of time you have to spend answering repetitive questions in the forums.

Thanks!!
ID: 1418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 330,145
RAC: 94
Message 1420 - Posted: 2 Nov 2015, 22:17:54 UTC - in response to Message 1418.  

Hi Bill,

Ideally we would like to assign credit in terms of jobs processed or calculate it ourself using a standardized metric based on normalized CPU time but use the BOINC credit system as it is what the volunteers are familiar with.

Thanks for your suggestion of reducing the time for the task. Our decision to go with 24 hours was based on the experience with the Test4Theory project. In order to reduce the task time we would need to eliminate as much as possible the start up overhead.

Porting the code base to Windows was tried initially and is not an option. It is not just the initial porting effort but also the challenge of ensuring that it is in sync. As the install based of PCs decreases we may have to have to start looking at tablets and mobiles but by that time they hopefully will have increased in power.

When we move to production the main entry point will be the LHC@home portal.

http://lhcathome.web.cern.ch/

Please let me know if you have any feedback for this. Just use the contact-us form and I will look out for the message.

http://lhcathome.web.cern.ch/contact-us

Regards,

Laurence
ID: 1420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN