61) Message boards : Number crunching : issue of the day (Message 1479)
Posted 18 Nov 2015 by Profile Tern
Post:
Linux host (Ubuntu) - after today's OS update and restart:

CMS "Waiting to run (Scheduler wait: Please update/recompile VirtualBox Kernal Drivers.)"

Downloading VBox 5.0.10 now (was on 5.0.8). Sigh.
62) Message boards : Number crunching : Expect errors eventually (Message 1477)
Posted 16 Nov 2015 by Profile Tern
Post:
Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one.

This is expected behaviour, when you've installed BOINC in protected execution mode (service mode).


Unless that's the default (only choice?) on Mac, I don't think I selected service mode. Know I was offered the choice on Windows and didn't take it. Of course, the Mac is the only machine I have that isn't just for BOINC, Photoshop, media center, or NAS RAID storage, so the Windows boxes are all single-user. Shrug. Doesn't bother me either way, I'm NOT a fan of VBox and will probably not run it once CMS goes to production mode. Didn't know what I was getting into when I agreed to run it here. My policy is, to get me to sign up, a project MUST have a Mac app (with rare exceptions). I hadn't dealt with the "Mac but through VBox" issue before, but until it becomes the only choice, I'll modify my policy to be "Mac native" app available.

Comes from the years I spent as a Mac developer. :-)
63) Message boards : Number crunching : Expect errors eventually (Message 1476)
Posted 16 Nov 2015 by Profile Tern
Post:
Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time.


Not Kaspersky, as I'm not running it. Or any non-built-in antivirus, since these hosts are only connected to the net for BOINC, do no web browsing. That's what the Mac is for! :-)
64) Message boards : Number crunching : Expect errors eventually (Message 1473)
Posted 16 Nov 2015 by Profile Tern
Post:
Okay, can't prove it yet, but VirtualBox 5.0.10 SEEMS to have solved the hypervisor problem. CMS ran on all my Windows hosts overnight without any timeouts waiting for me this morning. :-)

And on the Mac... IT GOT WORK! Don't know for sure if it was the change to 5.0.10, or the fact that I had VBox actually RUNNING overnight when CMS decided to fetch work, but regardless, there is a task actually running on the Mac finally. (#1!) Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one.
65) Message boards : Number crunching : Expect errors eventually (Message 1467)
Posted 16 Nov 2015 by Profile Tern
Post:
Correction, was running VBox 4.3.12 or w/e that came w/ BOINC on the Windows boxes; after having to restart two or three times each on three hosts in a couple of hours, went ahead and downloaded 5.0.10 on all of them. Only been running a half hour so far, but none have failed yet, which is an improvement...

Trying other ideas on the Mac side, but of course it's "not the highest priority project" so it doesn't want work right now! :-/
66) Message boards : Number crunching : Expect errors eventually (Message 1463)
Posted 15 Nov 2015 by Profile Tern
Post:

The latest Vboxwrapper should resolve that issue.

See: https://github.com/BOINC/boinc/releases/tag/vboxwrapper%2F26178

----- Rom


Except that update was released on CMS Oct 22, the tasks I currently have running were sent Nov 11-14, gave error message today, and the Mac last gave the 'not installed' message this morning.

Win10 Home, latest BOINC. Mac OS X 10.11.1 "El Capitan", ditto. VBox 5.0.8 on all - just got notice of 5.0.10 for Mac, downloading it now.

If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what?
67) Message boards : Number crunching : Expect errors eventually (Message 1461)
Posted 15 Nov 2015 by Profile Tern
Post:
All resumed! :-)

edit: Except half of them say "Hypervisor failed to enter an online state in a timely manner" and are sleeping for a day, and of course the Mac STILL says "Virtualbox not installed"...
68) Message boards : Number crunching : Expect errors eventually (Message 1459)
Posted 15 Nov 2015 by Profile Tern
Post:
Have suspended tasks in progress - please let us know when we should resume!
69) Message boards : Number crunching : issue of the day (Message 1434)
Posted 7 Nov 2015 by Profile Tern
Post:
what time-zone are you in, looks to be UTC +6?


Believe so - Dallas/Chicago Central Time.

Was it doing other work at the time?


DENIS@Home, Malariacontrol, CSG probably; have CMS set at 50% to minimize swapping, but 1-task limit keeps it down to 1 core per host. That box has 4 (slow) cores, AMD APU.

{quote]It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image.[/quote]

Will do! Thanks.
70) Message boards : Number crunching : issue of the day (Message 1431)
Posted 7 Nov 2015 by Profile Tern
Post:
"VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes...

User 306, task 69452, wu 64594, host 782.
71) Message boards : Number crunching : issue of the day (Message 1426)
Posted 5 Nov 2015 by Profile Tern
Post:
Never mind, that message was from 10/22, about wrapper 26178. Already knew that didn't change anything.
72) Message boards : Number crunching : issue of the day (Message 1425)
Posted 5 Nov 2015 by Profile Tern
Post:
Got project message that new wrapper was in place to fix "VirtualBox not running" on Mac (I think, it was talking about Mac path changes anyway). Still getting same error here - unless there are still old tasks in the queue?

Silly question... since the task is not being downloaded and failing, but is never being downloaded at all, how is a wrapper change going to fix the problem?
73) Message boards : Macintosh : BOINC fine, but can't get CMS to work. (Message 1424)
Posted 5 Nov 2015 by Profile Tern
Post:
Just got project message of new wrapper that fixes problem - but trying to get a task still gives same error.

Silly question maybe... since the task isn't getting here and failing, but instead is never being sent at all, how is fixing the wrapper going to change anything?
74) Message boards : Macintosh : BOINC fine, but can't get CMS to work. (Message 1423)
Posted 4 Nov 2015 by Profile Tern
Post:
Running VBox 5.0.8 now, whatever current wrapper is, still having same problem. Can't get any work as "VirtualBox is not running". Found similar case on another board, Rom 'fixed' BOINC, but I'm running the fixed version...
75) Message boards : Number crunching : Expect errors eventually (Message 1418)
Posted 2 Nov 2015 by Profile Tern
Post:
One of the issues is that BOINC credit is only assigned at the end of a task and as volunteers who care about credit like to see this increase regularly, the VM is restarted every 24 hours so that the credit can be assigned. We could increase this time but it is also good to periodicity restart the VM occasionally and it is necessary in the case of updates.


I think the biggest problem you're going to have when you go live is that 24-hour choice. Your reply DID clear up some of my questions, thanks for that! IMHO, if you dropped from 24 to 12, or better yet, 6 or 4 hours, you'd solve a lot of problems - probably worth the overhead cost. On "normal" PCs, it is very likely that your task is going to be swapped out, often, and take a whole lot more than 24 hours "wall time". That vastly increases the chances of something going wrong - reboots, power failures, user error, network outages, systems turned off for a few days... Failed tasks are not a big deal for the project - just resend them. But for the user, they are very frustrating. Realize that when you say "24 hours", you aren't talking wall time, but "slot time" executing on the host - with CPU time within that being dependent on the number of jobs executed. 24 hours is a VERY long time compared to other BOINC projects. I'm guessing, but I'd say the average is much less that one hour per task! Yoyo is the only other one _I'm_ running that regularly ties up my slower systems for day after day. It's quite annoying.

We realize that the use of VMs is an extra barrier to entry and will result in less resources but it is our only option. We really hope that we can make this easier in the future. For example, checkout the CERN Public Computing Challenge that enables you to control that VM via a Web browser.

https://test4theory.cern.ch/challenge/


Looked, downloaded, it's running VBox on my Mac no problem. Unlike BOINC! :-/ I wouldn't say VM is your _only_ option... I realize the manpower involved, but there are development environments that can create cross-platform applications with relative ease - you don't have to rewrite everything for each OS, especially since you have no UI to worry about with BOINC. I'd hate to lose the Mac option, but realistically BOINC is large-majority Windows; I don't think you're going to get much power from phones and tablets and such. So, if VM turns out to not work well, you're only looking at two compilations for each source change to get Linux and Windows. If you do go GPU, you're looking at multiple versions anyway...


As you pointed out, each task runs multiple jobs but the data is uploaded after each job. The reason why a task does not run a single job is to improve efficiency as there is an overhead of booting the OS. Also as we use a custom HTTP-based caching file system call CVMFS, there is a network overhead associated with the first job and the subsequent ones benefit. This approach is identical to how CMS is running on standard cloud resources such as OpenStack. In fact we have now started calling this project 'the volunteer cloud'.


If data is uploaded after each job, you have a REAL credit issue. Simply that you don't award partial credit for tasks that do "n" jobs before failing! You've used up the hosts CPU time without "payment"... The way most volunteers (I think) look at BOINC is that you're "renting" their CPU time and paying in "credits". Sure, some will donate their time to a project of interest without caring about credit, but many (most?) do like to see those credits climb, at a _reasonable_ rate (ahem, Bitcoin, ahem...). You're going to have to come up with some way to give a "base" amount for "slot rent", plus an amount for each job done. (Wall time plus CPU time somehow.) The more jobs/task, the larger this issue is going to become. Credit though is not the important point right now, getting tasks to run successfully to completion is, and I'm having problems with that even though I'm micromanaging everything right now.

Again I'm unclear on whether a task has all it's "jobs" at initiation or not... if it does, then you gain nothing by running on and on after the jobs are done? Simply exiting when complete might solve a lot of issues. Forget "24 hours" and just track "how many jobs per task" give the best result, balancing overhead with returned-valids. Would eliminate the "last job cut off" issue as well, when the time is up. I would think it would make things easier on your end as well!

And, you could offer say, three "applications" - "short", "medium", and "long", that the users could pick from under preferences. That'd be something like 50/75/100 jobs/task. Always nice to give the user options.

GPU computing is on our radar and something that may be important for us in the future. The future support for GPUs in VBox is unclear and something that we would need to clarify if we head in this direction.


GPU computing is of course optional for any project. I don't know the structure of your algorithms, so I don't know if it would give you a big boost or not. Thus this is something that only you can decide. The point of the project is, at bottom, to produce useful work for CMS, not just to keep credit-chasers happy.

We understand that the most important aspect of volunteer computing is the volunteers themselves who are donating their resources and hence the importance of good communication. This project is currently in development and hence not at the level of maturity of projects such as Seti/Einstein/Rosetta etc. but we hope to get there. If there are aspects that you think we can improve, please let us know.


Website updates! Take what has come up in the forums as issues (VBox, credit, task-vs-jobs, etc.) and explain those things thoroughly, on or near the home page. Users shouldn't have to wade through the forums to find things like your post above to learn what is going on. This will VASTLY reduce the amount of time you have to spend answering repetitive questions in the forums.

Thanks!!
76) Message boards : Macintosh : BOINC fine, but can't get CMS to work. (Message 1417)
Posted 2 Nov 2015 by Profile Tern
Post:
Thanks for the try! Still same on Mac, have CMS working on Windows and Linux just fine. Urg!
77) Message boards : Number crunching : Expect errors eventually (Message 1410)
Posted 2 Nov 2015 by Profile Tern
Post:
NEW problem! Task 68947 (user 306, work unit 64089, host 748) ran it's 22 hours and completed, 15 hrs CPU time... but is showing on the website as "error while computing" and "194 (0xc2) EXIT_ABORTED_BY_CLIENT". There are some strange lock file errors in the log, but they're a half-hour before it called boinc_finish and then for some reason fired VBox up again?!?
78) Message boards : Number crunching : Expect errors eventually (Message 1408)
Posted 1 Nov 2015 by Profile Tern
Post:
Bringing all the issues you raise onto one project or Twiki page is on my "to do" list, but in all honesty my involvement was only supposed to be "one or two hours a week", but it's turned into more like 24/7.


Understand completely. "Hey Bill, the family business needs to upgrade our computers. Your background is in IT..." (Mainframes and Macs, NOT Windows or Linux!) And here I sit with a half-dozen partially built systems a month later... building racks today! :-/ The fact that I've never dealt with GPUs before and half of ours keep overheating isn't helping. At least BOINC is pointing out the problem before it gets to a user who will never look at the temperatures. The "low end" R9-280s seem just as bad as the GTX 780Ti's too. Nobody wants to pay for _new_ GPUs. Or water cooling, although I insisted on AIOs for the CPUs at least. I'm still in the "swap that GPU into that case and see if it runs any better" stage, (and throwing in more case fans) as any of the GPUs would be fine for our purposes in any of the systems. All that matters is having the right CPU horsepower in the right place. If I wind up with a 280 in the 5820K box, and a 780Ti in an AMD, it won't matter. They're all overpowered for what we do, which may make all this worrying about temperatures unnecessary. Suspect somebody bought the 780s to play games after hours...

I'm sure _you_ get what I'm saying about projects and credits - just look at your own sig and compare your SETI/Einstein to all the CERN stuff. !!! I'm happy to help with whatever CPU time and message board time I can give; as I've said, I remember when Rosetta, Predictor, and Sztaki were starting up. And when SETI moved to BOINC, for that matter. Talk about confusion due to poor communication! The communication to the volunteers is where MOST projects fall down badly. Rosetta is probably the best-communicating project out there (or used to be, haven't looked lately, haven't needed to). It's a tossup on who is the worst - figuring out how to actually do _anything_ with Bitcoin is almost impossible because the "miners" speak their own jargon, while the CERN projects have great LOOKING websites that you can't find anything on... and when you do, it's a link out to Oracle's VBox support site, which is even worse. Sigh.

I'll keep crunching CMS tasks in the hopes that they'll help out somehow. I signed on with you instead of any of the other CERN projects _because_ you were a "startup". If anybody has a functioning Mac on CMS and could go to the Q&A section to give me a hand, I'll add a Mac or two to the farmlet.
79) Message boards : Number crunching : Expect errors eventually (Message 1406)
Posted 1 Nov 2015 by Profile Tern
Post:
One other thought - much shorter "tasks" would allow you to lift the one-task-per-host limit and get you a lot more cores executing at any given time. BOINC is already swapping out your task on my hosts as other projects get higher priority, so there's no gain to one-task since you can't enforce it remaining in a running state... I'm playing games with my hosts, suspending projects and such, to try to keep CMS running and sending returns as close to every-24 as I can. The "average BOINCer" is more likely to return one task a week, if you're lucky.
80) Message boards : Number crunching : Expect errors eventually (Message 1405)
Posted 1 Nov 2015 by Profile Tern
Post:
That is the exact same thing that happens to my Windows box at work. If you ever manage to find out what causes it, do let me know, because I've looked hard and long without success. What version of VirtualBox are you running? The version that BOINC serves out works, but it's old (4.3.12 IIRC).
I notice that you had got some work done before it locked up [http://boincai05.cern.ch/CMS-dev/result.php?resultid=68770]. Mine always bombs at start-up. There may be clues in that log for anyone more au fait with the workings of VirtualBox.


No new clues here. LHC message boards show quite a few of the same issue, seems to be all with Windows 10. (No resolutions, and the one "patch" recommended just killed one of my CMS tasks instead of resurrecting it.) My guess is that VBox (at least the version that comes with BOINC, which is what I'm running) and Win10 don't get along very well. Yet another reason why using VBox is, in spite of some advantages, not a great idea IMHO. The main problem of course is blocking yourselves off from using GPU computing... I realize that decision was made at a much higher CERNish level...

My Linux box meanwhile just (slowly) returned it's first CMS task, no problem.

Maybe silly question, maybe I misunderstand the project (there's not a lot of info to be found, which is expected at this stage). You have "tasks" that run on the host for 24 hours. That "task" runs "jobs", which are much shorter in duration (loaded with the task? retrieved as available?). At the end of each "job", nothing is sent back to you. At the end of the "task", you get all the data.

Meanwhile, BOINC runs "BOINCmgr", that runs on the host 24/7. That program runs "tasks", which are much shorter in duration. These are retrieved from the project as available. At the end of each "task", the data is sent back to the project. Unless I'm missing something, your "task/job" split is simply replicating what BOINC already does. I don't understand why each "job" is not a single "task". The only advantage I see is in getting your results in blocks of 50/75/100/whatever instead of one at a time. And you create the must-run-non-stop issue, and the what-about-credit issue, and the must-remain-reliable-for-24-hours issue, and... Would it not make a lot more sense to bundle "x" jobs (whether that be 1 or 100) into a task and have the task exit when the jobs are finished? If there is some reasoning behind your approach that I'm missing, great - but that reasoning is exactly what should be "front and center" on the project website! Otherwise, when you go "live", you're going to have a whole lot of people just as confused as I am.

I personally use "credits" just to balance my project workload. Bitcoin has .01% of my current CPU time (funding BOINCStats) while CMS has 15%. This may just reward projects that don't give much credit, but there are plenty of BOINC "credit chasers" who do things just the other way around. I realize it's too early to be worried VERY much about how you'll handle credits, but doing away with the 24-hour tasks would sure simplify things. I'm afraid to be "competitive" in the credit arena, you'll have to give a bazillion credits per task and just become another BitcoinUtopia, in order to get the number of hosts you want. The average "set and forget" BOINC user is not going to return very many good results with your current setup, will get very little credit, and will quickly move on to other projects.

I'm sure all of these things have been considered - but since I can't easily locate the data, I figured I'd throw in my 2 credits worth! :-)


Previous 20 · Next 20


©2024 CERN