Message boards : News : Migrating to vLHC@home
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 1683 - Posted: 28 Jan 2016, 12:44:34 UTC - in response to Message 1671.  

Hi Bill,

The smallest atomic unit (no pun intended) that we have is an event which is a simulation of a single collision. A job is a collection of events and with the virtualized approach, a BOINC task can run a number of jobs. The balancing act is how many events should a job generate, i.e. how often should we upload the result and how long should we run before restarting the VM. By de-coupling the jobs and tasks, we can hopefully achieve an optimal approach.

Due the the nature of the computation, this is a heavy weight application and the virtualization aspect is something that adds to the complexity. It also has to integrate with the computation frameworks of the experiments that are mammoth global infrastructures.

Of course making everything work well, especially under error conditions is a challenge and something that we hope to improve on over time.
ID: 1683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Steve Hawker*

Send message
Joined: 6 Mar 15
Posts: 19
Credit: 142,109
RAC: 0
Message 1687 - Posted: 28 Jan 2016, 17:50:15 UTC - in response to Message 1653.  

The CMS beta application which should be identical to this one is now available in vLHC@home.


Nitpicking possibly but here I can only run one at a time, which I prefer. But at vLHC I get two tasks and then either have to mess about with a config file or manually manage the tasks. Not the same experience as here at dev.

May I also add my voice to those who do not want to migrate to vLHC. SETI, Rosetta and Einstein have their own beta projects where they harden their apps on a smaller audience of committed crunchers. There have been and are other projects that keep their sandboxes separate. Might I urge you to retain this system.

As for whether the app is ready for primetime, I'd say yes but for the continual stream of bug reports. It's always worked for me IIRC but others experience issues. Of course its your decision but it doesn't seem baked enough to me.

Finally, seeing as I've opened my mouth, can I beg you to move away from VBox. It seems to cause more problems than its worth. I don't understand how a collection of brilliant minds at CERN cannot build native apps for the three main platforms. I'd even be OK with a Docker version as that seems to work fine over at Cosmology. I'll continue to crunch because I like to crunch everything, but I know serious crunchers who won't come near this project because it has Vbox.

Thanks for listening

Steve
ID: 1687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1688 - Posted: 28 Jan 2016, 17:54:02 UTC

I would like to add, that i believe it is a very bad idea for vlhc to just link to the running batch of THIS project.
ID: 1688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1689 - Posted: 28 Jan 2016, 18:18:50 UTC - in response to Message 1681.  

Comments regarding virulent comments above :

- other projects have dedicated and separated beta projects like Seti or Einstein, why not vLHC ?

- other projects have "migrated credits" : CSG from 3 different boinc projects. Nobody died.


The beta project question I can't address, not my area.

On comments, I assume mine were the "virulent" ones - I assure you that this was not my intent, and apologize to anyone who took them that way. I strongly WANT this project (and vLHC!) to succeed, but have equally strong convictions about which projects I personally will support (or not support) with my time and CPU power, and why. If all I cared about was credits, I'd be on BitcoinUtopia full time, from project launch. Instead, I spent weeks researching whether they were something I would support. At first the answer was "no". Then I dabbled with it. Read their boards - a lot. And the opinions of other BOINC folks. Then I finally bought an ASIC and put it on the project. (Not a simple process, I promise.) Are they perfect? Nowhere close. Are they "worthy"? Well, I came to the conclusion that they are, especially once they stopped taking a "cut" of funds and made their own project just another place you _could_ contribute, if you wished. I go through this on EVERY project, before I sign up. I'm only running 22 active projects - I know people who have run over 100. Most (I think, I could be wrong) are after the hours or the credits or the cool looking sig, and aren't concerned with the long-term success of the projects they run, or even with what the projects do. Some of those folks are here - they're the ones who AREN'T posting on the boards, and will never read this. :-/

My problem with vLHC, and with CMS, although I hoped CMS would be the place where things would change, is that the Powers That Be at CERN just don't seem to "get" BOINC. They want to use it - but they want to do things "their way" while using the software, but only paying lip service to the "BOINC way", just enough to avoid controversy. That is perfectly natural, and that tendency is there at every project, but in my experience, it causes problems. Lawrence seems to get my point (and Ivan does get it, I think) - but doesn't have the background experience of working with BOINC and its volunteers, so doesn't _really_ grasp the problems that I see. He's at too high a level, is the only way I can think of to put it. At least he knows there IS a "community", and that "public relations" is important. Many projects (now dead or close to it, like SZTAKI, which I fought with from day one - and am still attached to!) have never gotten there. Rosetta (my favorite project!) understood from day one. Compare my postings there to those at SZTAKI. CERN seriously needs a "BOINC evangelist" who KNOWS how BOINC works, knows the history, haunts the boards, can write clear documentation, and (most importantly) has the authority to say "we can't do that". Or at least the authority to say "if we do that, we have to COMMUNICATE, clearly and unambiguously, and in BOINC terminology, exactly what we are doing and why". Communication is key, and as great as Ivan has been, he hasn't had the answers to some of the questions, or the authority to bump things up in priority inside the project. Or to translate/modify the terminology (come on, "jobs" and "tasks"?). Nor is this his field - he's a scientist/technical guy who fell into the role and somehow did it well, simply because he was willing to devote the time and to listen and to learn.

I am frustrated because this whole "CMS-Dev" beta project seems to have solved a lot of the _internal_ problems that CERN saw, yet solved exactly what from the BOINC volunteer view? We still have no error reporting to the volunteer, no automatic host back off with excessive errors, no solution to network saturation or availability problems, no graceful exit of the last "job", problems when BOINC suspends a "task", problems when the host reboots, failures to observe BOINC preference settings, problems with VBox corruption, and - maybe least important to CERN, but MOST important in terms of getting large numbers of volunteers down the road - no coherent credit process. But the internal problems are solved, so the BOINC problems get shunted off to "later", and the app gets moved to vLHC. Hello? Wasn't the place to solve those problems _here_?

As far as CSG - I knew they had "absorbed" other projects. I was unaware that they somehow transferred the credits. I know SETI didn't, I know CPDN didn't, I know WCG didn't. However CSG managed it, if CMS does the same thing, I have no problem with it. So - who at CERN has researched the process and has a plan for doing it without causing public relations issues, and is in charge of communicating the forthcoming process to the volunteers? Oh, right. Never mind. That's relevant to BOINC, not relevant to CERN internal software issues, so they'll do that "later". Got it. Sigh. And yeah, that's probably NOT the intent, or the thought process - Lawrence is way smarter than that - but it's the perception. Public Relations 101. I see my role here (self appointed of course) as being the annoying (but never virulent!) voice that keeps pushing CERN to actually look at what BOINC offers, and do what is necessary to get the most out of BOINC without "reinventing the wheel" and alienating volunteers. Hm. I just admitted to breaking the "rules" off to the left of the page, "no messages intended to annoy"! Oh well. :-)
ID: 1689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1690 - Posted: 28 Jan 2016, 18:27:52 UTC - in response to Message 1673.  

... the SSD thrashing which appears to be endemic to VBox, VBox itself (which seems to get 'corrupted' and require project resets way too often) ...

Using the recommended BOINC version 7.6.22 will substantial decrease the amount of problems here.
For your Win10 machine: it's known there is an issue in the combination Win10 versus VBox 5.0.10. Upgrade your Win10 to VBox 5.0.12.


This is good news. I note that until THIS posting, I didn't know that, because I haven't seen it prominently posted...

BOINC released 7.6.22 on December 30, and clearly announced that - most of my hosts have been updated, the others will be soon. All are either Mac, Linux, or Win10, no "lower" versions of Windows.

Then they RE-released it (without fanfare) on January 16, to include VBox 5.0.12. I missed that release. VBox itself of course has not asked me to update anything, nor has CMS. Now that I know, I will update as soon as I can. Thanks.

Relying on Oracle for anything, though... sigh. What happens if 6.0.1 is completely incompatible with CMS? That's the fear in the back of my mind for any project using VBox.
ID: 1690 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1691 - Posted: 28 Jan 2016, 18:32:44 UTC

But the internal problems are solved, so the BOINC problems get shunted off to "later", and the app gets moved to vLHC. Hello? Wasn't the place to solve those problems _here_?


I agree.
I also have the impression, that somebody got shouted at: "We need results!"
Then, caution was thrown to the wind and suddenly the CMS-dev development was declared fit for duty.
Not a very good practice.
ID: 1691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1693 - Posted: 28 Jan 2016, 19:09:09 UTC - in response to Message 1690.  

Relying on Oracle for anything, though... sigh. What happens if 6.0.1 is completely incompatible with CMS? That's the fear in the back of my mind for any project using VBox.


Quoting myself? Urg. Needed to add to that.

I'm a former Oracle-certified DBA, so I've dealt with them before on that level. I'm a former Sysadmin, so there too. I have nothing against the company, but their concern is the "big picture". Unless CERN has some serious "pull" with Oracle, CERN has _no_ control over what VBox fixes, doesn't fix, changes, etc. Of course this is also true with Microsoft and Apple, so maybe dealing with _one_ 500-pound Gorilla is better than dealing with two, but I am not so sure.

When I worked for a certain Fortune 100 company, we put all our reliance (my project, as a matter of fact, but I wasn't leading it) on some hardware and software manufactured by a third party. We were a big customer, but still a small part of their overall market. After several years of fighting to get certain fixes and changes, instead of succeeding, they made the product even LESS usable for us. Our solution was simple but expensive. We bought the whole freakin company and TOLD them to make the changes we wanted, and right now. The other choice was to go develop the replacement hardware we needed ourselves, and the associated software, which, without the in-house people with the expertise, would probably have been even more expensive. Not using that hardware at all wasn't an option, because that ship had sailed, and we were too committed to change course. I'm afraid that some day, CERN will realize that they either have to go Linux-only, or write three apps, and it will be even MORE painful then than if they just did it now. I've done cross-platform development and led teams that did it. It's not nice, or easy, but sometimes you just have to. The key is having the right people, and of course that requires the funding to pay them. If it's a funding issue, we (the volunteers) are stuck with VBox and any/all issues that it brings. All we can do is choose whether it's worth the hassles to each of us personally.
ID: 1693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 402
Message 1699 - Posted: 29 Jan 2016, 0:56:38 UTC - in response to Message 1689.  

Bill, I accept your comments, criticisms and frustrations in equal parts. My problem is that I'm serving at least two masters (you can probably guess who), and one isn't quite as enthusiastic as the other.
Now, I've been asked to present a coherent document Wednesday week on what needs to be done to bring CMS@Home up to production quality. I'm trying to think of a way that I can submit my opinion, you the volunteers can submit your opinions, and CERN IT can comment on feasibility. Some sort of Twiki page perhaps, if there's a way to restrict it to CMS-dev Volunteers only? But I have a very urgent report to prepare for next Thursday, I need to go into a hermit's retreat until the 5th.
ID: 1699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1700 - Posted: 29 Jan 2016, 1:12:21 UTC

But I have a very urgent report to prepare for next Thursday


Thanks for taking the time to reply. Much appreciated.
ID: 1700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 1702 - Posted: 29 Jan 2016, 10:42:10 UTC - in response to Message 1687.  
Last modified: 29 Jan 2016, 12:06:14 UTC

Hi Steve,

I think that we found the reason why you got two tasks for the beta and changed the configuration so hopefully you will only get one in the future.

The direction that we would like to go is to have one LHC@home BOINC project which has different applications for the different experiments. With this in mind we have a number of options for the testing and development environment.

  • beta apps in the LHC@home project
  • a test/devel LHC@home project
  • separate projects per application



Which option is preferred?

I agree that this app is not quite ready for primetime but I hope that by starting a process of continuous improvement we can get there soon.

Unfortunately we need the virtualized approach and VBox is what the work has been based on up to now. We can try to address the issues but if this becomes a blocking issue we can review our options.

ID: 1702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 1707 - Posted: 29 Jan 2016, 12:05:46 UTC - in response to Message 1689.  

Hi Bill,

First of all I think it is important to highlight that we all agree on the importance of communication. Feedback is welcome and critical for the success of these projects.

The challenge that we have (and you have pointed out the perception of the consequences) is that we are trying to marry two paradigms; The BOINC world and the mammoth experiment frameworks. Such partnerships require compromises on both sides to succeed but without the community there is no volunteer computing. Understanding the community is an important, unique and interesting aspect of this.

The improvements that you mention will be investigated and hopefully some fixes will be available next week;


  • no graceful exit of the last job
  • error reporting to the volunteer
  • no automatic host back off with excessive errors
  • no solution to network saturation or availability problems
  • problems when BOINC suspends a task
  • problems when the host reboots
  • failures to observe BOINC preference settings
  • problems with VBox corruption
  • no coherent credit process

ID: 1707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 1708 - Posted: 29 Jan 2016, 12:24:31 UTC - in response to Message 1699.  
Last modified: 29 Jan 2016, 12:42:58 UTC

Hi Ivan,

Yes having independent and authoritative feedback from the volunteers would be very good to have. My suggestion would be to form some kind of community representation with our most active and vocal volunteers. Can they self-organize this?

EDIT: I would suggest we create a new thread on this topic.
ID: 1708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Belgique] MK

Send message
Joined: 5 Jan 16
Posts: 3
Credit: 1,385
RAC: 0
Message 1713 - Posted: 29 Jan 2016, 17:15:41 UTC - in response to Message 1669.  
Last modified: 29 Jan 2016, 17:18:15 UTC

Hi MK,

I think that this is already the case. For those who have more powerful and dedicated setups we can create a multicore versions of the app. e.g. 2 cores, 4 cores etc.



Hi Laurence,
my computer is currently running 2 CMS VM at the sametime. So there is no limit to 1 CMS VM per computer.. It would be great if you could add a feature on N of VM running CMS, I believe this must be easy to implement
ID: 1713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1714 - Posted: 29 Jan 2016, 17:54:51 UTC - in response to Message 1702.  

Hi Steve,
The direction that we would like to go is to have one LHC@home BOINC project which has different applications for the different experiments. With this in mind we have a number of options for the testing and development environment.

  • beta apps in the LHC@home project
  • a test/devel LHC@home project
  • separate projects per application



Which option is preferred?



Not Steve, but you know I have to put in my $0.02! :-)

The "one LHC project" (for production apps) is ideal. No sense having umpteen projects that come and go with different apps as needed. Agree totally with this plan.

Beta apps in the LHC@Home project.... eh. It's okay if you never have more than one app in beta at a time, AND if you set up threads in the message boards specific to the beta app. Looking at the LHC message boards, this would be a problem - quite frankly, they are a mess - stickied threads with no postings for years, threads totally off topic, threads with "critical" titles (stickied even) giving solutions to problems that no longer exist, hanging unanswered questions... there has obviously been no effective attempt to keep up with the LHC message boards already, so adding more there is probably not a good idea.

A test/development project - ideal! (IMHO) This keeps everything much simpler. However... I would hope that a lesson has been learned HERE. Namely - don't put the app into LHC@Home until it is truly "ready for production", both from a CERN standpoint AND from a BOINC standpoint. What has been done here is a combination of having a beta project _and_ having (the same) beta app in the main project. Not good.

New project per beta app - not bad, but not good, really, because you wind up having to "invite" participants all over again. If the beta project persists but just doesn't have any work, well, your volunteers are still attached, and as soon as you GET a new app, you've got a built-in pool of people to help you with it. Plus there are all the hassles of setting up a project over and over again.


I agree that this app is not quite ready for primetime but I hope that by starting a process of continuous improvement we can get there soon.


BRAAPPP! If it's not ready for primetime, (which I think we all agree on)... then why is it moving from here? If it was a few niggling bugs, I could see doing this, if the app is critical and you really, really need massive BOINC participation ASAP, and know you'll have issues to deal with, and are willing to put up with it. But you're putting it there as a beta app that not everyone will be running, so that explanation doesn't fly. What you're doing is setting the completion date BACK! You'll gain a few more volunteers over there who, (unless they've already found it there) are not familiar with the app yet and will have a learning curve. You're losing some of the volunteers from here who aren't going to move. You're losing the separate message boards focused solely on this app - and unless you manually create new threads there and transfer over all the "documentation" that has been created here into them, then new people will be even more lost. I also don't get the feeling that this move is being made to get more volunteers for testing; you could do that by putting up an "invite" at LHC@Home, and get them to come _here_. Honestly, I can't figure out any logical reason for making the move at all!

Why hasn't the "continuous improvement" process already started? As in here? So far, we've seen patches and hacks and individual bug fixes (many were BOINC or VBox upgrades, not even app changes!) and had hints that a lot of work has gone on in the background making the app more usable for CERN. But honestly, we've seen very little improvement from OUR viewpoint. That's fine, and expected - you've got to get it right on your end first, obviously. But... the hope was that after it was "good" on your side, the other problems would get attention. Maybe that's the intent still, but this FEELS like "if we move the app, the new guys won't know that we've known about these problems for months and haven't fixed them yet, and will be more patient". Someone on LHC said that the boards here (CMS) were a "whine fest". That may be true... (whine! whimper!) but isn't that the POINT of having a beta? To get the (often negative) feedback from a small set of users before throwing the app out in front of thousands? If all I wanted was a bunch of "wow, that looks great", I'd never put an app in beta, I'd only show it to friends and family...

That is another advantage, BTW, of having a separate beta project - people on LHC are already USED to dealing with VBox, and the 24-hour tasks, and the incoherent credit (a popular topic over there). Thinking about it, maybe from a self-important view, if CMS had been a beta app within LHC, I never would have seen it, and you'd never have had the pleasure of getting my wonderful feedback... :-) From a non-self-important view, there are others here (10% or so, wasn't it?) who haven't dealt with anything CERN yet, and so you got to hear again, in a smaller forum, about pre-existing issues that maybe over there, people have learned to put up with - or have given up on and bailed. The fact that they aren't "ongoing issues" at LHC doesn't mean they aren't issues, just that they aren't being brought up any more.
ID: 1714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1715 - Posted: 29 Jan 2016, 18:55:46 UTC - in response to Message 1707.  

Hi Bill,

First of all I think it is important to highlight that we all agree on the importance of communication. Feedback is welcome and critical for the success of these projects.


If I didn't know that you valued feedback, I wouldn't be giving any! There are many projects where I don't even bother posting any more, because I know it's pointless.

The challenge that we have (and you have pointed out the perception of the consequences) is that we are trying to marry two paradigms; The BOINC world and the mammoth experiment frameworks. Such partnerships require compromises on both sides to succeed but without the community there is no volunteer computing. Understanding the community is an important, unique and interesting aspect of this.


It's been a learning curve for me as well, understanding that other paradigm, and how it affects the BOINC side. I really do "get" why the choices that have been made have been made, even though I don't agree with some of them.

The improvements that you mention will be investigated and hopefully some fixes will be available next week;


  • no graceful exit of the last job
  • error reporting to the volunteer
  • no automatic host back off with excessive errors
  • no solution to network saturation or availability problems
  • problems when BOINC suspends a task
  • problems when the host reboots
  • failures to observe BOINC preference settings
  • problems with VBox corruption
  • no coherent credit process



Those were off the top of my head, maybe someone else can add to the list? I know I may be beating a dead horse here, but ALL of these (except VBox, which from the LHC boards also seems to be the problem with not observing preferences - a change Oracle made for security!) could be solved with one change. Namely, and oversimplified, quit "doing your own BOINC inside VBox" instead of using the built-in BOINC features. I think you can see how the other problems would be reduced or eliminated if you look at it from the simplest point - the credit issue. So, time to deal with it or not, here are my thoughts.

CMS (and T4T? I think so) give a fixed or semi-fixed amount of credit per task regardless of how much real work was done. That is because a task can "idle" for hours, doing nothing but taking up a slot. Errors aren't reported to the volunteer because one or two job failures aren't a big problem, and a task failure (with no credit) after running for 23 hours would cause a stink. Credit CAN'T be given for actual work done, because person "A" might get lucky and get a ton of jobs in his task, while person "B" does nothing for 24 hours, through no fault of his own. Network load is continuous and heavy because the task is looking for work to do, and then reporting each job's results independently.

The fix. 1) When the task is created on the server, whatever jobs are available right then (up to some count) are loaded into the task. The task is much "larger", but that's okay - BOINC can handle that. There is ZERO network traffic looking for work, because it's already there. To throw a sop to the "cheating" thread from LHC, one of the jobs could be a "dummy", that you already know the output from and can check for. 2) Instead of uploading results as they become available from each job, write them to a file. Zero network traffic again, but a large (maybe huge) file to be uploaded when the task is complete. That's okay. BOINC can handle that (including automatic retries, backoffs, observing network preferences...) 3) If there are excessive job failures within the task, the task gets a "computation error" and quits. That lets BOINC handle host back off, tells the volunteer he has a problem, etc. 4) When the last job is completed (no "last job" issues!) the task is completed. No hanging around for hours doing nothing. BOINC sends the task back to CERN, with the job results. Credit is based on (gee!) the time it took to complete the task and the benchmarks, or CreditNew, or whatever the flavor of the week is. Faster processors complete the task faster and get more credit, because more work was done per unit time. No "slot rent". Task size can be varied by changing the number of jobs embedded, raised or lowered as makes sense to the project.

I _THINK_ that the task suspension and reboot problems are really network problems - namely, this is just another way that network availability can be lost for a while. So if network availability isn't critical, these problems go away too.

If job results are "time critical" - i.e., you need to get them the second they are available - then your current system makes perfect sense. If you can wait a couple of hours, then it doesn't, it's just there because that's how it's done outside of BOINC and nobody wanted to change it. Which is, again, fine - that's a project decision - but CERN _must_ be aware of the ramifications of that decision - which is the list of "bugs" above, that will never really be "solved", no matter how many band-aids get slapped on them. Is that a problem? Not really. Just put them on the LHC home page so volunteers know what they are getting into, and expect to have a smaller number of volunteers than you would otherwise have. If that tradeoff is okay, if making these changes would be too difficult for the return expected, fine! The point is to make the decision based on the data, not because "that's the way we've always done it before BOINC".

From my reading the LHC boards, the CMS issues are the same ones for T4T. Maybe the NEXT beta project (which of course really should have been this one) can investigate "doing it different", either what I've described, or what someone else comes up with... The nice thing is that most of the changes could be done one-at-a-time and tested, especially in a beta project where glitches are expected. I obviously haven't seen your code, but my "gut feel" is that most of these changes would be fairly simple to implement. Read from/write to a file instead of a URL? Minor. Creating and interpreting the files, yeah, a bit more complex, but not unreasonable. The other stuff is all BOINC-specific application code, nothing that would be there for any other platforms, so somebody at CERN wrote it just for this, hopefully based on the BOINC sample apps, and should be able to change it.
ID: 1715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 1725 - Posted: 29 Jan 2016, 23:39:50 UTC - in response to Message 1715.  

Hi Bill,

I personally do not see this project a whine-fest. Negative feedback is more likely to result in change and hence improvements.

We potentially have 6 LHC related applications (Six Track, Test4Theory, ALICE, ATLAS, CMS and LHCb) and as a result could have between 1 and 12 projects. I think you have convinced me that the best scenario would be to have two projects, one production and one development as they represent different communities. Will create a poll with the possibilities to see what the consensus is on this.

With respect to the beta, maybe we have different semantics. For me beta means no guaranteed service level and hence things may break. The expectations on a beta app or dev project should be similar. However, although most CMS-dev volunteers are also in vLHC@home, I understand the point of different communities and focus. What about renaming this project to LHC-dev or something similar. Have any other project renamed themselves?

To paraphrase your suggestion, it is essentially to use BOINC as a batch system as originally intended. ATLAS@home is doing this and seem to be doing quite well. The other model is closer to the cloud paradigm and BOINC in this context is an elastic provisioning system for cloud resources, upon which we overlay a batch system. We can't really do a discussion on this topic justice in a forum thread.

Credit is always a complex topic. With clouds, metering is based on the weighted wall time of a slot (a reservation on the system), the CPU time represents the value that is extracted from that reservation. The reservation is under the control of the volunteer but the efficiency is more dependent on the application itself.

The suspension and reboot problems are quite complex as the jobs contain monitoring. If the jobs don't report, the frameworks consider them zombie jobs. So it is more of a time issue than network issue. Fixing issues like this are non-trivial and the kind work that is not seen here.
ID: 1725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1729 - Posted: 30 Jan 2016, 2:33:09 UTC - in response to Message 1725.  

Hi Bill,

I personally do not see this project a whine-fest. Negative feedback is more likely to result in change and hence improvements.


You're very welcome. :-)

We potentially have 6 LHC related applications (Six Track, Test4Theory, ALICE, ATLAS, CMS and LHCb) and as a result could have between 1 and 12 projects. I think you have convinced me that the best scenario would be to have two projects, one production and one development as they represent different communities. Will create a poll with the possibilities to see what the consensus is on this.


Excellent! It _can_ be done "in project", and that's probably simpler from the project's viewpoint, but there's a reason SETI and Einstein set up separate projects for beta testing. There are definite advantages. I'm pretty sure how the poll will come out, but by all means get a consensus!

With respect to the beta, maybe we have different semantics. For me beta means no guaranteed service level and hence things may break. The expectations on a beta app or dev project should be similar. However, although most CMS-dev volunteers are also in vLHC@home, I understand the point of different communities and focus. What about renaming this project to LHC-dev or something similar. Have any other project renamed themselves?


We agree on "beta"... the confusion/annoyance on my part is having it be beta in BOTH places. Why bother with CMS-Dev at all in that case? <shrug>

I can't think offhand of any projects that have just renamed themselves, but it _shouldn't_ be difficult on the BOINC side (I could be wrong!) The problems will come on the analysis and stats sites, as each of them will have to make the changes separately. Free-DC already calls this "LHC Dev@Home", for whatever reason (maybe clairvoyance?), so it would just be a URL change for Bok. Boincstats calls it "CERN CMS-dev", so the change would be more complex, but Willie can handle it. (I can see his reaction when he hears I've said that...) BOINC Synergy calls it "CMS dev" and will probably be the problem site, as I'm not sure anybody is even doing any maintenance there any more - there are projects I've been running for months that they still haven't added. Those are the three I use, I know there are many more. Still, that's _their_ problem.

The analysis site, WuProp, calls it "CMS-dev" and the application is "CMS Simulation". (I have, at this instant, 3,722.38 hours recorded 'done' on your application - more than any other but Rosetta, but of course WuProp didn't exist when I started, and I didn't find them until recently, or SETI would be highest). If you're not familiar with them, you can find out exactly how each CPU brand and model perform on YOUR app. Great site, tons of data, but hard to navigate, and has no clue about jobs and tasks, so probably not real useful to you anyway.

If it's not a big deal to just close "CMS-Dev" and create a new "LHC-Dev", rather than renaming it, that would obviously be easier on the third-parties. It'd be better for those of us with existing credit, on the other hand, if you managed to rename it. (Which as an interested party, I'd vote for.) "Inactive Projects" are always a drag on sigs and stats. I'd go with whatever is easier for you, which means talking to some BOINC staff about how a rename would work.

To paraphrase your suggestion, it is essentially to use BOINC as a batch system as originally intended. ATLAS@home is doing this and seem to be doing quite well. The other model is closer to the cloud paradigm and BOINC in this context is an elastic provisioning system for cloud resources, upon which we overlay a batch system. We can't really do a discussion on this topic justice in a forum thread.

Credit is always a complex topic. With clouds, metering is based on the weighted wall time of a slot (a reservation on the system), the CPU time represents the value that is extracted from that reservation. The reservation is under the control of the volunteer but the efficiency is more dependent on the application itself.


But when you reserve time on a cloud computing system, YOU specify "I want 24 hours on each of 100 cores of your Xeon E6-2699's, at $x per core-hour." Then the efficiency is up to you and your app. With BOINC, you're saying "I want 24 hours on one core of some random processor that may be an i7-4790K or may be a 10-year-old Celeron, I don't care which, and I'll pay the same (in credits) for either. I'm willing to have zero control over whether the computer in question does one job for me or 100." Great, but don't be surprised when many of your volunteers deliberately put your app on their _slowest_ machines and reserve the fast ones for projects that "pay" based on work done... My Linux box is a Gigabyte Brix AMD quad-core APU. Very slow, but it was $300 _complete_. "Harpoon" (Win10) is an i7-5820K six-core, twelve-thread, water cooled and overclocked to 4GHz, with a fast GPU and an ASIC hanging off of it. It gets me almost 3,000,000 credits per day, easily. It was, um, more than $300... Which of those would any "normal" (i.e.; not me...) volunteer put CMS on, given that either system is going to return the exact same credit per day, while tying up a core slot that could be used for something else? Uh... let me think...

(I just looked. Brix has 15,273 CMS credits, while Harpoon has 38,110. So I'm not real bright about credits...)

The suspension and reboot problems are quite complex as the jobs contain monitoring. If the jobs don't report, the frameworks consider them zombie jobs. So it is more of a time issue than network issue. Fixing issues like this are non-trivial and the kind work that is not seen here.


AHA! See, now this all begins to make some sense. That's the piece I was missing, the job monitoring and frameworks issue. That makes everything more complicated, so everything I've come up with is now obviously oversimplified. That will require me to put on my analyst hat for a while... I'll get back with you offline, if I come up with anything that might be useful.
ID: 1729 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1731 - Posted: 30 Jan 2016, 9:37:02 UTC

Just some stats for why a separate "dev" site might be better than using vLHC@Home: (In addition to the message board issue.)

CMS-Dev
Users 189
Active users 99 (52.38%)

vLHC@Home
Users 14,458
Active users 2,349 (16.25%)

"Active users" is anyone granted credit in the last month. Any project with over 50% "active" is doing VERY well and probably has a lot of very involved volunteers. Data from BOINCStats.
ID: 1731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,827
RAC: 35
Message 1732 - Posted: 30 Jan 2016, 10:19:51 UTC - in response to Message 1731.  

Just some stats for why a separate "dev" site might be better than using vLHC@Home: (In addition to the message board issue.)

CMS-Dev
Users 189
Active users 99 (52.38%)

vLHC@Home
Users 14,458
Active users 2,349 (16.25%)

"Active users" is anyone granted credit in the last month. Any project with over 50% "active" is doing VERY well and probably has a lot of very involved volunteers. Data from BOINCStats.

It's not a very well documented comparison, IMHO. It has a lot to do with how long a project exists.
vLHCathome (Former Test4Theory, so renaming is not a problem) exists 4½ years now after their alfa-phase.

E.g.: SETI@Home: Active users 8,89%
World Community Grid: active 11.26%
ID: 1732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1734 - Posted: 30 Jan 2016, 10:58:18 UTC - in response to Message 1732.  

It's not a very well documented comparison, IMHO. It has a lot to do with how long a project exists.
vLHCathome (Former Test4Theory, so renaming is not a problem) exists 4½ years now after their alfa-phase.

E.g.: SETI@Home: Active users 8,89%
World Community Grid: active 11.26%


True - the first month, every project is at 100% active. My point isn't that vLHC is "low", because it's not, it's about right. The issue is that CMS-Dev is "high", which speaks well for the volunteers who self-selected; most have stuck around. That means most will probably continue to stick around. At least 99 of 'em! That is true at the other "dev" projects as well, the volunteers are more involved than those at "production" projects.

Yay for renaming! :-)
ID: 1734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : News : Migrating to vLHC@home


©2024 CERN