41) Message boards : Number crunching : issue of the day (Message 1735)
Posted 30 Jan 2016 by Profile Tern
Post:
New problem?... Mac OS 10.11.2, BOINC 7.6.22. "Show VM Console" button gives error message "Missing application. Please download and install the CoRD application from http://cord.sourceforge.net".

If that's something we need to have, that information needs to be posted with "system requirements" when the user signs up, along with what the heck it is, and why we need it. (Ideally with a link...)

Had a runaway CMS task (33 hours with 8 left to go) and was trying the vLHC fix of editing the checkpoint file, part of which requires checking the VM, which of course on a Mac can't be done from VBox itself... Sigh. And no, the checkpoint-edit fix didn't work, tried repeatedly. Finally aborted the task. No biggie at this point!
42) Message boards : News : Migrating to vLHC@home (Message 1734)
Posted 30 Jan 2016 by Profile Tern
Post:
It's not a very well documented comparison, IMHO. It has a lot to do with how long a project exists.
vLHCathome (Former Test4Theory, so renaming is not a problem) exists 4½ years now after their alfa-phase.

E.g.: SETI@Home: Active users 8,89%
World Community Grid: active 11.26%


True - the first month, every project is at 100% active. My point isn't that vLHC is "low", because it's not, it's about right. The issue is that CMS-Dev is "high", which speaks well for the volunteers who self-selected; most have stuck around. That means most will probably continue to stick around. At least 99 of 'em! That is true at the other "dev" projects as well, the volunteers are more involved than those at "production" projects.

Yay for renaming! :-)
43) Message boards : News : Constructive suggestions please (Message 1733)
Posted 30 Jan 2016 by Profile Tern
Post:
Sending the results by BOINC would result in the same saturation and failing jobs.

One could first try to use a (better) compression method before uploading??

If that would not lead to a big improvement, the server should not be so critical when results are coming in parts.
The average upload speed of each client is reported by the hosts and I can't imagine that there is no BOINC-server setting to exclude clients with too low bandwidth for getting a CMS-task.


You are absolutely right that using BOINC does nothing for the QUANTITY of data. There compression is very much needed, especially if not already in place, or a drastic reduction in what's sent back. Is all that data really needed? How much is overhead that could go away if "batched" instead of "real time"? Who knows? I was astounded to see my uplink saturated (it's only DSL, but still!) by five or six hosts running one CMS task each. That's a LOT of data being created!

Even on _download_, I see a problem. CMS "polls" the server looking for work once per SECOND when it doesn't have any. That's ridiculously excessive; my router log files overfloweth. (It was this "is there work" URL that my new security software didn't like, which of course meant the "blocked" message went to dev/null {or the VM, same thing} instead of to a browser, which made that fun to find...)

On job failures, I don't see how sending via BOINC would have the same problem, or not to the same extent. BOINC will retry as needed if there's a transmission failure. Currently I guess there are no retries from within the VM. (?) This fix assumes that the server, as you say, MUST NOT be so time-critical since the current problem is the data must be "now" or it's a job failure. It's not the retries themselves that would be the "fix", but the fact that the server wouldn't (couldn't) be so demanding, and if the data didn't arrive until tomorrow, that's still okay, as long as it arrives. (Which assumes we aren't generating data faster than we can send it, of course.) Each task would only be sending one large file, instead of many small ones or a stream, which should also improve failure rates a bit. Given the difficulty of changing this part, though, I would be looking very strongly at some way to reduce the volume of data transferred first, since that will really be needed anyway. (vLHC allows two CMS tasks per host? Fun.)

I don't know if the "average upload speed" check would help or not. I would guess that each of my hosts report roughly the same value, which would be the TOTAL available upload speed, which is more than enough. (For one host, even with CMS.) Doubt it's smart enough to count the hosts on that line and divide... or somehow test all the hosts simultaneously. Only way that would completely solve the problem is if all the hosts were running CMS at the time the data was collected, or 24/7 if it's a running average; "average upload speed 0 bps". :-( Then I'd never get any work.
44) Message boards : News : Migrating to vLHC@home (Message 1731)
Posted 30 Jan 2016 by Profile Tern
Post:
Just some stats for why a separate "dev" site might be better than using vLHC@Home: (In addition to the message board issue.)

CMS-Dev
Users 189
Active users 99 (52.38%)

vLHC@Home
Users 14,458
Active users 2,349 (16.25%)

"Active users" is anyone granted credit in the last month. Any project with over 50% "active" is doing VERY well and probably has a lot of very involved volunteers. Data from BOINCStats.
45) Message boards : News : Migrating to vLHC@home (Message 1729)
Posted 30 Jan 2016 by Profile Tern
Post:
Hi Bill,

I personally do not see this project a whine-fest. Negative feedback is more likely to result in change and hence improvements.


You're very welcome. :-)

We potentially have 6 LHC related applications (Six Track, Test4Theory, ALICE, ATLAS, CMS and LHCb) and as a result could have between 1 and 12 projects. I think you have convinced me that the best scenario would be to have two projects, one production and one development as they represent different communities. Will create a poll with the possibilities to see what the consensus is on this.


Excellent! It _can_ be done "in project", and that's probably simpler from the project's viewpoint, but there's a reason SETI and Einstein set up separate projects for beta testing. There are definite advantages. I'm pretty sure how the poll will come out, but by all means get a consensus!

With respect to the beta, maybe we have different semantics. For me beta means no guaranteed service level and hence things may break. The expectations on a beta app or dev project should be similar. However, although most CMS-dev volunteers are also in vLHC@home, I understand the point of different communities and focus. What about renaming this project to LHC-dev or something similar. Have any other project renamed themselves?


We agree on "beta"... the confusion/annoyance on my part is having it be beta in BOTH places. Why bother with CMS-Dev at all in that case? <shrug>

I can't think offhand of any projects that have just renamed themselves, but it _shouldn't_ be difficult on the BOINC side (I could be wrong!) The problems will come on the analysis and stats sites, as each of them will have to make the changes separately. Free-DC already calls this "LHC Dev@Home", for whatever reason (maybe clairvoyance?), so it would just be a URL change for Bok. Boincstats calls it "CERN CMS-dev", so the change would be more complex, but Willie can handle it. (I can see his reaction when he hears I've said that...) BOINC Synergy calls it "CMS dev" and will probably be the problem site, as I'm not sure anybody is even doing any maintenance there any more - there are projects I've been running for months that they still haven't added. Those are the three I use, I know there are many more. Still, that's _their_ problem.

The analysis site, WuProp, calls it "CMS-dev" and the application is "CMS Simulation". (I have, at this instant, 3,722.38 hours recorded 'done' on your application - more than any other but Rosetta, but of course WuProp didn't exist when I started, and I didn't find them until recently, or SETI would be highest). If you're not familiar with them, you can find out exactly how each CPU brand and model perform on YOUR app. Great site, tons of data, but hard to navigate, and has no clue about jobs and tasks, so probably not real useful to you anyway.

If it's not a big deal to just close "CMS-Dev" and create a new "LHC-Dev", rather than renaming it, that would obviously be easier on the third-parties. It'd be better for those of us with existing credit, on the other hand, if you managed to rename it. (Which as an interested party, I'd vote for.) "Inactive Projects" are always a drag on sigs and stats. I'd go with whatever is easier for you, which means talking to some BOINC staff about how a rename would work.

To paraphrase your suggestion, it is essentially to use BOINC as a batch system as originally intended. ATLAS@home is doing this and seem to be doing quite well. The other model is closer to the cloud paradigm and BOINC in this context is an elastic provisioning system for cloud resources, upon which we overlay a batch system. We can't really do a discussion on this topic justice in a forum thread.

Credit is always a complex topic. With clouds, metering is based on the weighted wall time of a slot (a reservation on the system), the CPU time represents the value that is extracted from that reservation. The reservation is under the control of the volunteer but the efficiency is more dependent on the application itself.


But when you reserve time on a cloud computing system, YOU specify "I want 24 hours on each of 100 cores of your Xeon E6-2699's, at $x per core-hour." Then the efficiency is up to you and your app. With BOINC, you're saying "I want 24 hours on one core of some random processor that may be an i7-4790K or may be a 10-year-old Celeron, I don't care which, and I'll pay the same (in credits) for either. I'm willing to have zero control over whether the computer in question does one job for me or 100." Great, but don't be surprised when many of your volunteers deliberately put your app on their _slowest_ machines and reserve the fast ones for projects that "pay" based on work done... My Linux box is a Gigabyte Brix AMD quad-core APU. Very slow, but it was $300 _complete_. "Harpoon" (Win10) is an i7-5820K six-core, twelve-thread, water cooled and overclocked to 4GHz, with a fast GPU and an ASIC hanging off of it. It gets me almost 3,000,000 credits per day, easily. It was, um, more than $300... Which of those would any "normal" (i.e.; not me...) volunteer put CMS on, given that either system is going to return the exact same credit per day, while tying up a core slot that could be used for something else? Uh... let me think...

(I just looked. Brix has 15,273 CMS credits, while Harpoon has 38,110. So I'm not real bright about credits...)

The suspension and reboot problems are quite complex as the jobs contain monitoring. If the jobs don't report, the frameworks consider them zombie jobs. So it is more of a time issue than network issue. Fixing issues like this are non-trivial and the kind work that is not seen here.


AHA! See, now this all begins to make some sense. That's the piece I was missing, the job monitoring and frameworks issue. That makes everything more complicated, so everything I've come up with is now obviously oversimplified. That will require me to put on my analyst hat for a while... I'll get back with you offline, if I come up with anything that might be useful.
46) Message boards : News : Constructive suggestions please (Message 1716)
Posted 29 Jan 2016 by Profile Tern
Post:
By far the greatest issue to resolve is failing hosts.


I mostly agree with this, except maybe the "greatest" part, the network issues are bigger for me personally - the terminology is "host back-off" in BOINC, and there's a method of doing it. I posted a long-winded (of course!) answer to Laurence's list in the other thread, including this issue; I can copy it here if you like, or you can grab it there... I didn't see this thread in time.
47) Message boards : News : Migrating to vLHC@home (Message 1715)
Posted 29 Jan 2016 by Profile Tern
Post:
Hi Bill,

First of all I think it is important to highlight that we all agree on the importance of communication. Feedback is welcome and critical for the success of these projects.


If I didn't know that you valued feedback, I wouldn't be giving any! There are many projects where I don't even bother posting any more, because I know it's pointless.

The challenge that we have (and you have pointed out the perception of the consequences) is that we are trying to marry two paradigms; The BOINC world and the mammoth experiment frameworks. Such partnerships require compromises on both sides to succeed but without the community there is no volunteer computing. Understanding the community is an important, unique and interesting aspect of this.


It's been a learning curve for me as well, understanding that other paradigm, and how it affects the BOINC side. I really do "get" why the choices that have been made have been made, even though I don't agree with some of them.

The improvements that you mention will be investigated and hopefully some fixes will be available next week;


  • no graceful exit of the last job
  • error reporting to the volunteer
  • no automatic host back off with excessive errors
  • no solution to network saturation or availability problems
  • problems when BOINC suspends a task
  • problems when the host reboots
  • failures to observe BOINC preference settings
  • problems with VBox corruption
  • no coherent credit process



Those were off the top of my head, maybe someone else can add to the list? I know I may be beating a dead horse here, but ALL of these (except VBox, which from the LHC boards also seems to be the problem with not observing preferences - a change Oracle made for security!) could be solved with one change. Namely, and oversimplified, quit "doing your own BOINC inside VBox" instead of using the built-in BOINC features. I think you can see how the other problems would be reduced or eliminated if you look at it from the simplest point - the credit issue. So, time to deal with it or not, here are my thoughts.

CMS (and T4T? I think so) give a fixed or semi-fixed amount of credit per task regardless of how much real work was done. That is because a task can "idle" for hours, doing nothing but taking up a slot. Errors aren't reported to the volunteer because one or two job failures aren't a big problem, and a task failure (with no credit) after running for 23 hours would cause a stink. Credit CAN'T be given for actual work done, because person "A" might get lucky and get a ton of jobs in his task, while person "B" does nothing for 24 hours, through no fault of his own. Network load is continuous and heavy because the task is looking for work to do, and then reporting each job's results independently.

The fix. 1) When the task is created on the server, whatever jobs are available right then (up to some count) are loaded into the task. The task is much "larger", but that's okay - BOINC can handle that. There is ZERO network traffic looking for work, because it's already there. To throw a sop to the "cheating" thread from LHC, one of the jobs could be a "dummy", that you already know the output from and can check for. 2) Instead of uploading results as they become available from each job, write them to a file. Zero network traffic again, but a large (maybe huge) file to be uploaded when the task is complete. That's okay. BOINC can handle that (including automatic retries, backoffs, observing network preferences...) 3) If there are excessive job failures within the task, the task gets a "computation error" and quits. That lets BOINC handle host back off, tells the volunteer he has a problem, etc. 4) When the last job is completed (no "last job" issues!) the task is completed. No hanging around for hours doing nothing. BOINC sends the task back to CERN, with the job results. Credit is based on (gee!) the time it took to complete the task and the benchmarks, or CreditNew, or whatever the flavor of the week is. Faster processors complete the task faster and get more credit, because more work was done per unit time. No "slot rent". Task size can be varied by changing the number of jobs embedded, raised or lowered as makes sense to the project.

I _THINK_ that the task suspension and reboot problems are really network problems - namely, this is just another way that network availability can be lost for a while. So if network availability isn't critical, these problems go away too.

If job results are "time critical" - i.e., you need to get them the second they are available - then your current system makes perfect sense. If you can wait a couple of hours, then it doesn't, it's just there because that's how it's done outside of BOINC and nobody wanted to change it. Which is, again, fine - that's a project decision - but CERN _must_ be aware of the ramifications of that decision - which is the list of "bugs" above, that will never really be "solved", no matter how many band-aids get slapped on them. Is that a problem? Not really. Just put them on the LHC home page so volunteers know what they are getting into, and expect to have a smaller number of volunteers than you would otherwise have. If that tradeoff is okay, if making these changes would be too difficult for the return expected, fine! The point is to make the decision based on the data, not because "that's the way we've always done it before BOINC".

From my reading the LHC boards, the CMS issues are the same ones for T4T. Maybe the NEXT beta project (which of course really should have been this one) can investigate "doing it different", either what I've described, or what someone else comes up with... The nice thing is that most of the changes could be done one-at-a-time and tested, especially in a beta project where glitches are expected. I obviously haven't seen your code, but my "gut feel" is that most of these changes would be fairly simple to implement. Read from/write to a file instead of a URL? Minor. Creating and interpreting the files, yeah, a bit more complex, but not unreasonable. The other stuff is all BOINC-specific application code, nothing that would be there for any other platforms, so somebody at CERN wrote it just for this, hopefully based on the BOINC sample apps, and should be able to change it.
48) Message boards : News : Migrating to vLHC@home (Message 1714)
Posted 29 Jan 2016 by Profile Tern
Post:
Hi Steve,
The direction that we would like to go is to have one LHC@home BOINC project which has different applications for the different experiments. With this in mind we have a number of options for the testing and development environment.

  • beta apps in the LHC@home project
  • a test/devel LHC@home project
  • separate projects per application



Which option is preferred?



Not Steve, but you know I have to put in my $0.02! :-)

The "one LHC project" (for production apps) is ideal. No sense having umpteen projects that come and go with different apps as needed. Agree totally with this plan.

Beta apps in the LHC@Home project.... eh. It's okay if you never have more than one app in beta at a time, AND if you set up threads in the message boards specific to the beta app. Looking at the LHC message boards, this would be a problem - quite frankly, they are a mess - stickied threads with no postings for years, threads totally off topic, threads with "critical" titles (stickied even) giving solutions to problems that no longer exist, hanging unanswered questions... there has obviously been no effective attempt to keep up with the LHC message boards already, so adding more there is probably not a good idea.

A test/development project - ideal! (IMHO) This keeps everything much simpler. However... I would hope that a lesson has been learned HERE. Namely - don't put the app into LHC@Home until it is truly "ready for production", both from a CERN standpoint AND from a BOINC standpoint. What has been done here is a combination of having a beta project _and_ having (the same) beta app in the main project. Not good.

New project per beta app - not bad, but not good, really, because you wind up having to "invite" participants all over again. If the beta project persists but just doesn't have any work, well, your volunteers are still attached, and as soon as you GET a new app, you've got a built-in pool of people to help you with it. Plus there are all the hassles of setting up a project over and over again.


I agree that this app is not quite ready for primetime but I hope that by starting a process of continuous improvement we can get there soon.


BRAAPPP! If it's not ready for primetime, (which I think we all agree on)... then why is it moving from here? If it was a few niggling bugs, I could see doing this, if the app is critical and you really, really need massive BOINC participation ASAP, and know you'll have issues to deal with, and are willing to put up with it. But you're putting it there as a beta app that not everyone will be running, so that explanation doesn't fly. What you're doing is setting the completion date BACK! You'll gain a few more volunteers over there who, (unless they've already found it there) are not familiar with the app yet and will have a learning curve. You're losing some of the volunteers from here who aren't going to move. You're losing the separate message boards focused solely on this app - and unless you manually create new threads there and transfer over all the "documentation" that has been created here into them, then new people will be even more lost. I also don't get the feeling that this move is being made to get more volunteers for testing; you could do that by putting up an "invite" at LHC@Home, and get them to come _here_. Honestly, I can't figure out any logical reason for making the move at all!

Why hasn't the "continuous improvement" process already started? As in here? So far, we've seen patches and hacks and individual bug fixes (many were BOINC or VBox upgrades, not even app changes!) and had hints that a lot of work has gone on in the background making the app more usable for CERN. But honestly, we've seen very little improvement from OUR viewpoint. That's fine, and expected - you've got to get it right on your end first, obviously. But... the hope was that after it was "good" on your side, the other problems would get attention. Maybe that's the intent still, but this FEELS like "if we move the app, the new guys won't know that we've known about these problems for months and haven't fixed them yet, and will be more patient". Someone on LHC said that the boards here (CMS) were a "whine fest". That may be true... (whine! whimper!) but isn't that the POINT of having a beta? To get the (often negative) feedback from a small set of users before throwing the app out in front of thousands? If all I wanted was a bunch of "wow, that looks great", I'd never put an app in beta, I'd only show it to friends and family...

That is another advantage, BTW, of having a separate beta project - people on LHC are already USED to dealing with VBox, and the 24-hour tasks, and the incoherent credit (a popular topic over there). Thinking about it, maybe from a self-important view, if CMS had been a beta app within LHC, I never would have seen it, and you'd never have had the pleasure of getting my wonderful feedback... :-) From a non-self-important view, there are others here (10% or so, wasn't it?) who haven't dealt with anything CERN yet, and so you got to hear again, in a smaller forum, about pre-existing issues that maybe over there, people have learned to put up with - or have given up on and bailed. The fact that they aren't "ongoing issues" at LHC doesn't mean they aren't issues, just that they aren't being brought up any more.
49) Message boards : News : Migrating to vLHC@home (Message 1693)
Posted 28 Jan 2016 by Profile Tern
Post:
Relying on Oracle for anything, though... sigh. What happens if 6.0.1 is completely incompatible with CMS? That's the fear in the back of my mind for any project using VBox.


Quoting myself? Urg. Needed to add to that.

I'm a former Oracle-certified DBA, so I've dealt with them before on that level. I'm a former Sysadmin, so there too. I have nothing against the company, but their concern is the "big picture". Unless CERN has some serious "pull" with Oracle, CERN has _no_ control over what VBox fixes, doesn't fix, changes, etc. Of course this is also true with Microsoft and Apple, so maybe dealing with _one_ 500-pound Gorilla is better than dealing with two, but I am not so sure.

When I worked for a certain Fortune 100 company, we put all our reliance (my project, as a matter of fact, but I wasn't leading it) on some hardware and software manufactured by a third party. We were a big customer, but still a small part of their overall market. After several years of fighting to get certain fixes and changes, instead of succeeding, they made the product even LESS usable for us. Our solution was simple but expensive. We bought the whole freakin company and TOLD them to make the changes we wanted, and right now. The other choice was to go develop the replacement hardware we needed ourselves, and the associated software, which, without the in-house people with the expertise, would probably have been even more expensive. Not using that hardware at all wasn't an option, because that ship had sailed, and we were too committed to change course. I'm afraid that some day, CERN will realize that they either have to go Linux-only, or write three apps, and it will be even MORE painful then than if they just did it now. I've done cross-platform development and led teams that did it. It's not nice, or easy, but sometimes you just have to. The key is having the right people, and of course that requires the funding to pay them. If it's a funding issue, we (the volunteers) are stuck with VBox and any/all issues that it brings. All we can do is choose whether it's worth the hassles to each of us personally.
50) Message boards : News : Migrating to vLHC@home (Message 1690)
Posted 28 Jan 2016 by Profile Tern
Post:
... the SSD thrashing which appears to be endemic to VBox, VBox itself (which seems to get 'corrupted' and require project resets way too often) ...

Using the recommended BOINC version 7.6.22 will substantial decrease the amount of problems here.
For your Win10 machine: it's known there is an issue in the combination Win10 versus VBox 5.0.10. Upgrade your Win10 to VBox 5.0.12.


This is good news. I note that until THIS posting, I didn't know that, because I haven't seen it prominently posted...

BOINC released 7.6.22 on December 30, and clearly announced that - most of my hosts have been updated, the others will be soon. All are either Mac, Linux, or Win10, no "lower" versions of Windows.

Then they RE-released it (without fanfare) on January 16, to include VBox 5.0.12. I missed that release. VBox itself of course has not asked me to update anything, nor has CMS. Now that I know, I will update as soon as I can. Thanks.

Relying on Oracle for anything, though... sigh. What happens if 6.0.1 is completely incompatible with CMS? That's the fear in the back of my mind for any project using VBox.
51) Message boards : News : Migrating to vLHC@home (Message 1689)
Posted 28 Jan 2016 by Profile Tern
Post:
Comments regarding virulent comments above :

- other projects have dedicated and separated beta projects like Seti or Einstein, why not vLHC ?

- other projects have "migrated credits" : CSG from 3 different boinc projects. Nobody died.


The beta project question I can't address, not my area.

On comments, I assume mine were the "virulent" ones - I assure you that this was not my intent, and apologize to anyone who took them that way. I strongly WANT this project (and vLHC!) to succeed, but have equally strong convictions about which projects I personally will support (or not support) with my time and CPU power, and why. If all I cared about was credits, I'd be on BitcoinUtopia full time, from project launch. Instead, I spent weeks researching whether they were something I would support. At first the answer was "no". Then I dabbled with it. Read their boards - a lot. And the opinions of other BOINC folks. Then I finally bought an ASIC and put it on the project. (Not a simple process, I promise.) Are they perfect? Nowhere close. Are they "worthy"? Well, I came to the conclusion that they are, especially once they stopped taking a "cut" of funds and made their own project just another place you _could_ contribute, if you wished. I go through this on EVERY project, before I sign up. I'm only running 22 active projects - I know people who have run over 100. Most (I think, I could be wrong) are after the hours or the credits or the cool looking sig, and aren't concerned with the long-term success of the projects they run, or even with what the projects do. Some of those folks are here - they're the ones who AREN'T posting on the boards, and will never read this. :-/

My problem with vLHC, and with CMS, although I hoped CMS would be the place where things would change, is that the Powers That Be at CERN just don't seem to "get" BOINC. They want to use it - but they want to do things "their way" while using the software, but only paying lip service to the "BOINC way", just enough to avoid controversy. That is perfectly natural, and that tendency is there at every project, but in my experience, it causes problems. Lawrence seems to get my point (and Ivan does get it, I think) - but doesn't have the background experience of working with BOINC and its volunteers, so doesn't _really_ grasp the problems that I see. He's at too high a level, is the only way I can think of to put it. At least he knows there IS a "community", and that "public relations" is important. Many projects (now dead or close to it, like SZTAKI, which I fought with from day one - and am still attached to!) have never gotten there. Rosetta (my favorite project!) understood from day one. Compare my postings there to those at SZTAKI. CERN seriously needs a "BOINC evangelist" who KNOWS how BOINC works, knows the history, haunts the boards, can write clear documentation, and (most importantly) has the authority to say "we can't do that". Or at least the authority to say "if we do that, we have to COMMUNICATE, clearly and unambiguously, and in BOINC terminology, exactly what we are doing and why". Communication is key, and as great as Ivan has been, he hasn't had the answers to some of the questions, or the authority to bump things up in priority inside the project. Or to translate/modify the terminology (come on, "jobs" and "tasks"?). Nor is this his field - he's a scientist/technical guy who fell into the role and somehow did it well, simply because he was willing to devote the time and to listen and to learn.

I am frustrated because this whole "CMS-Dev" beta project seems to have solved a lot of the _internal_ problems that CERN saw, yet solved exactly what from the BOINC volunteer view? We still have no error reporting to the volunteer, no automatic host back off with excessive errors, no solution to network saturation or availability problems, no graceful exit of the last "job", problems when BOINC suspends a "task", problems when the host reboots, failures to observe BOINC preference settings, problems with VBox corruption, and - maybe least important to CERN, but MOST important in terms of getting large numbers of volunteers down the road - no coherent credit process. But the internal problems are solved, so the BOINC problems get shunted off to "later", and the app gets moved to vLHC. Hello? Wasn't the place to solve those problems _here_?

As far as CSG - I knew they had "absorbed" other projects. I was unaware that they somehow transferred the credits. I know SETI didn't, I know CPDN didn't, I know WCG didn't. However CSG managed it, if CMS does the same thing, I have no problem with it. So - who at CERN has researched the process and has a plan for doing it without causing public relations issues, and is in charge of communicating the forthcoming process to the volunteers? Oh, right. Never mind. That's relevant to BOINC, not relevant to CERN internal software issues, so they'll do that "later". Got it. Sigh. And yeah, that's probably NOT the intent, or the thought process - Lawrence is way smarter than that - but it's the perception. Public Relations 101. I see my role here (self appointed of course) as being the annoying (but never virulent!) voice that keeps pushing CERN to actually look at what BOINC offers, and do what is necessary to get the most out of BOINC without "reinventing the wheel" and alienating volunteers. Hm. I just admitted to breaking the "rules" off to the left of the page, "no messages intended to annoy"! Oh well. :-)
52) Message boards : News : Migrating to vLHC@home (Message 1671)
Posted 28 Jan 2016 by Profile Tern
Post:
The main problem is the whole task-versus-job design. Constant (as in once per second, in my experience) polling of a server looking for work, then processing it and returning it, outside of the BOINC framework which already handles downloads, trickles, and uploads (including following the users network preferences, automatic retries, etc) while appearing to BOINC to be one "task", creates innumerable problems. This is in addition to the failure handling (nonexistent AFAIK, since you don't use BOINC's), the credit-for-slot-rental-not-work-done issue, the SSD thrashing which appears to be endemic to VBox, VBox itself (which seems to get 'corrupted' and require project resets way too often), and others. You really need to ASSUME that a user is on dial-up, and only has (slow) internet access for a few minutes a day (or week)! If you can't work with that, then you need to be very explicit in the requirements of the project. On the home page, NOT buried somewhere in the message boards. CPDN manages to move huge amounts of data (including trickles), all within the BOINC framework...

First, REQUIRED fix, before I would even consider running it elsewhere, is to decide how much actual work (number of jobs) a task is going to have, and send it with that many already embedded in it. Forget the 24-hour thing. If you want to reduce load on the server, fine, include "x" jobs in the task, such that the average host would take 24 hours to complete them. If you can handle more load and want shorter tasks, then include "x/4", or whatever. When an individual job is finished, if you wish, use the BOINC trickle mechanism to send back the results. Otherwise, upload them all at task completion. This one change would repair at least 75% of the networking, swap-out, error handling, time estimation, benchmarking, etc. concerns, AND solve any credit calculation issues. (Which apparently never did get dealt with here, in spite of repeated "we'll look at that later" - um, it's well past "later"...)

The issues that remain are probably VBox related, and I have no idea how to deal with those. Those are, however, enough to ensure that I won't be signing up for any future projects that require VBox, unless I see on the boards that they've all been worked out. When my Linux box told me I had to recompile something in order to get VBox working again after a simple OS update, and CMS was saying VBox wasn't there, well, when I finished laughing hysterically, I did it BECAUSE it was for CMS-Dev (a beta project). Same when I first set up my MacBook and CMS couldn't "see" VBox until I'd updated it like three times to new versions. If any of this had been for a "production" project, I would have detached and never come back. Or bitched on the message boards until the PROJECT solved the problem (probably by providing a link to the needed, precompiled, software, or at least an installer script and package - NOT by providing instructions on how to find some source code on the web somewhere and then compile something).

I haven't hit any memory problems with CMS, but most of my hosts have plenty - I have seen tasks from other projects on rare occasion in a "waiting for memory" state instead of "waiting to run", so BOINC obviously has some method of dealing with this. (New since I last looked at the code, which was 10 years ago, or I just didn't see it.) I'm sure it's related to the "use x% memory when computer is in use/not in use" setting, as the times I've seen the message are when I just walked up and moved the mouse. I've never seen any project lock up a machine as Ivan described vLHC-CMS doing to him. It _had_ to ignore the user preferences to be able to do that. (Which VBox does anyway on thread priorities, but that's another subject.)

I had the discussion with Ivan on the host failures issue, and the fact that he had to ask the volunteer (which was me several times, I freely admit) to reset, disconnect, desist, bypass security software, run fewer hosts per bandwidth, etc. - these are all things that in a production environment must be fixed on YOUR end, NOT ours! It's fine in a beta project to run into problems that require us to do something - it isn't tolerable in a production project. That's where vLHC has fallen down, from what I can tell; the volunteers are responsible for "doing something" way too often, other than just signing up for the project and watching credits roll in. Some, it is true, have no problems. Others aren't so lucky, and getting help on the message boards is iffy. Yes, I am a programmer. Have been since FORTRAN was state-of-the-art. But on BOINC, at least for the non-beta projects, I try NOT to be that, I try to be a "typical user". That means complaining a lot. :-) And THAT is something I know I'm good at! (Looked around the various boards the other day and realized that combined, I have several THOUSAND postings... and that's with an 8-year hiatus from BOINC!)

I really do wish you luck. And applaud Ivan for his patience, diligence, and support on the boards. With two or three more of him, you might be okay.
53) Message boards : Number crunching : issue of the day (Message 1661)
Posted 27 Jan 2016 by Profile Tern
Post:
Is there a way for the server(or vbox wrapper) to error out the cms-tasks of a host, if it produces too many errors?

There is a way for BOINC. <snip> This 500 is raised with 1 for every valid task returned and reduced with 1 for every error.
BOINC should be aware of an error and this is not the case here. For BOINC it doesn't matter what's happening inside the VM.
Only single job distribution with exit code transferring via wrapper to BOINC will make it possible to eliminate hosts which only produce errors.


Actually, unless they've changed (simplified) it, the bad-host handling is even better than that. It reduces the number of tasks when there are errors VERY quickly, down to one. Then as tasks succeed from that host, the number allowed increases slowly back up to the limit. This was put in place just for situations like this, but of course, as you say, CMS is not using the BOINC methods.
54) Message boards : News : Migrating to vLHC@home (Message 1660)
Posted 27 Jan 2016 by Profile Tern
Post:
If the beta app is working for you, please stop running here. Once most have migrated, no new tasks will be created and the accumulated credit can be migrated from here to vLHC@home.


A) I have no interest in running vLHC, as they have the same (self-created) problems as CMS-Dev, but are "production", and don't appear to even be trying to solve the issues. There are a LOT of issues still here that haven't been solved, so I guess you're giving up on them. Sorry, I'll put up with problems for a Beta project, hoping to help solve them, but I'm not going to spend all the time it takes to deal with VBox, poor task management, network and memory and disk overload, etc., there.

B) "Credit migrated" bothers me a LOT. I hope you rethink this, and quickly. For one thing, that is pretty much totally against the "rules" of BOINC as I understand them (and I was there when they were written in 2003-5, in fact I was tasked by Paul with documenting them for the Wiki). Because 1) if someone (like me) does NOT go over to vLHC, what happens to our credit? We earned it here, so I would hope that it would stay in the statistics sites and not just vanish... there is, after all, a "retired project" category for just this situation, and that is exactly what I naturally assumed would happen when the beta was completed, as it has for many other beta projects - but 2) if someone DOES move to vLHC, does that mean they get credit in BOTH places? That is SURE to attract some attention when it becomes common knowledge, especially from Dr. A... Give a CMS-Dev badge at vLHC if you wish, sure. But credit? No way. Different project. Credit is NOT transferrable! SETI didn't even do it. If the credit was to eventually wind up at vLHC, then CMS should have been a beta application within vLHC, NOT set up as a separate project.

I have absolutely no desire to have a single credit at vLHC, as this would totally screw up my stats, which I have carefully nurtured for over 10 years. Nor would I be happy to see the effort I've given here be lost.
55) Message boards : Number crunching : issue of the day (Message 1601)
Posted 14 Jan 2016 by Profile Tern
Post:
CPU time was right at 8:00:01. Reset project, downloading now, we'll see what happens!
56) Message boards : Number crunching : issue of the day (Message 1598)
Posted 13 Jan 2016 by Profile Tern
Post:
I have a CMS task that has been running for 30:47 and is showing 50.091% complete. Again, this is on a seldom-used iMac where this has happened before. Should this just be aborted? Lucky I even noticed it, almost didn't check that host today...

User: 306 Host: 873 Task: 75318 Work Unit: 52667

In case you need log files or w/e, I'll leave it running for now.
57) Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers (Message 1537)
Posted 2 Jan 2016 by Profile Tern
Post:
Rebooting a host causes problems. (Win10 does this randomly for 'updates'.)
BOINC Manager switching out a CMS task causes problems.
Network bandwidth limitation causes problems.
Corrupt VBox image causes problems. (Generally, VBox causes problems, but that's another topic.)
Job running at 24-hour mark causes problems.
Laptop going to sleep or moving away from network causes problems.
Credit issue, while not a problem YET, certainly will be once "live", especially if a single failed job in a task affects credit.

I see a few "implementation" areas that can be fixed to solve some of these, and I think the project has the people with the expertise to do so - but I think the bottom line is that the current DESIGN of the project is not very suitable for "most" BOINC users. I firmly believe that two fixes are needed - 1) each task should have a "set number" of jobs, pre-loaded, with results sent at end of task, and 2) tasks should be much shorter than 24 hours. I get the "overhead" issue, but there are projects that send me tasks that are completed in 13 SECONDS! CPDN and PrimeGrid have tasks that run for days, but they don't have any network requirements. A CMS task running for 24 hours just invites any of the above issues to cause a failure.

Those of us running this now are obviously enough "into" BOINC that our systems are relatively stable and productive, yet you're already seeing a very high level of failures, especially in stage-out. This is my first VBox project, so I can't compare how LHC is to CMS, but IMHO as a 12-year BOINCer and long-time programmer/project lead, the project as it stands is a long way from "production ready". I'm not giving up, I would really like to see this succeed! This is the purpose of doing a "DEV" trial run in the first place, to find just this kind of issues - the title of this thread is, unfortunately, all too true at present.
58) Message boards : Number crunching : issue of the day (Message 1532)
Posted 1 Jan 2016 by Profile Tern
Post:
Had a CMS job that had been running for 72 hours on my seldom-checked iMac - appeared "stuck" at 98.946% complete. Aborted it.

User 306
Task 74500
WU 69642
Host 873
Sent 28 Dec 2015, 11:30:34 UTC
Reported 31 Dec 2015, 22:56:12 UTC Aborted by user
Run Time 260,929.39
CPU Time 62,486.51
59) Message boards : Number crunching : Expect errors eventually (Message 1506)
Posted 4 Dec 2015 by Profile Tern
Post:
Oh yeah - what happens with method "b" if the host internet connection is not continuously online? There is NO way within CMS to "buffer" work (unless you COMPLETELY rewrite BOINC), so the host would connect, get a result from CMS with what, one "event' in it? Then disconnect and sit there for 24 hours, doing only that one event, before connecting again to upload it. Meanwhile having a lot of work from other projects going undone. How the heck would you pay credits for that???
60) Message boards : Number crunching : Expect errors eventually (Message 1505)
Posted 4 Dec 2015 by Profile Tern
Post:
There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017.


I have been trying to watch CPU utilization, and the current approach bothers me... SOUNDS like (sort of verified by the data I'm seeing, with some odd anomalies) the current method is to launch the result on our host and either a) do "n" events, then sit there and do nothing for the rest of the 24 hours, taking up an unnecessary "slot" on our host, or b) work for 24 hours then terminate even though there are "n-x" events still to be processed (losing the one that was in process as well).

The "different" model sounds like the plan is to completely replicate BOINC on a micro-scale, where the "result" acts like BOINC Manager, requesting work from the project, and either a) getting and processing an "event" (micro-result) or b) getting "no work available" and sitting there, again taking up an unnecessary slot on our host, until the 24 hours is up or more work becomes available. And if work becomes available at 23:59, well, oops! Lost event. Yes, if work was constantly available, no problem (except on calculating credits?) but if not, big problem.

BOTH approaches seem not only to be unnecessarily complex on YOUR side, but to have a built-in inefficiency when it comes to utilization of our volunteer computing time. From a volunteer standpoint, I still strongly maintain that the CORRECT way to do this is to stick with the current model, but terminate the result when the last event (whether it be #50 or #500) has been processed!!! This is how EVERY other project I am familiar with works (don't know about LHC because I don't run it). BOTH CMS approaches, as implemented or described, also CREATE a huge difficulty in awarding the proper number of credits, at least if you follow the "rules" and award credit based on Cobblestones (actual event processing). Namely, how do you "pay" for the time you have occupied my host without actually processing anything? If you pay nothing (i.e.; fixed credit per result, whether based on the number of events - method a - or not - method b since you don't know how many events if work wasn't available at some point), bye-bye volunteers. Unless you pay as if the entire 24 hours were actually "used", and then you're guilty of credit inflation, and BOINC admin will have a problem with that. (See WCG and the fact that there, you can earn more "badges" by running the _slowest_ computers you can find. At least there it doesn't affect credit, only badges, so BOINC doesn't care. The super-crunchers do though.) You also get into the "how the heck do we describe what we're doing so new crunchers know what to expect" issue, since you'll be so much different than other projects. Surprises = more lost volunteers.

Quake-Catcher is the only project I know that just pays "slot rent", but their results take almost _0_ CPU time (watching a sensor) so other projects are not prevented from running. (They don't really occupy a CPU slot at all.) They also give very little credit and only have a few thousand volunteers, most of whom are inactive. They also are closing down.

The current CMS approach also causes problems with "estimated time" - my little Linux box THINKS it's going to take 30+ hours per CMS result, which means it gets less work from other projects to compensate, then has to "catch up" once the CMS result's time falls to more like reality. BOINC uses "estimated Cobblestones" to calculate this time, and CMS obviously doesn't have a clue how many Cobblestones will be done by each result beforehand with EITHER of the above approaches! Knowing how many events were "in" a result at time of sending it to us would let you calculate this with pretty good efficiency.


Previous 20 · Next 20


©2024 CERN