21) Message boards : Number crunching : Credit Per App statistics (Message 5043)
Posted 10 Jul 2017 by Profile Tern
Post:
Bump.
22) Message boards : Number crunching : Credit Issues (Message 5031)
Posted 4 Jul 2017 by Profile Tern
Post:
the focus is on preparing new releases, testing new features and debugging issues


Yes... but "issues" includes ANYTHING "BOINC-related", because these sub-projects will eventually move to production and have the same problems if they aren't fixed here. Thus giving low credits will continue to be a problem there, if you don't work out how many credits should be granted...

Also, I suppose the 7 other people currently running Theory tasks here are all either employees or really really dedicated to helping this specific project, and that's great - and if you can do all your development and testing and debugging with that small a number of users, and limited number of sample hosts, well, congratulations! The point is that some of us came here because we want to help, but you seem to be doing your best to chase us off. (How many users have signed up but now have NO credit the last, say, three months?) Wouldn't it be better to have hundreds of users to test for you? WuProp says I gave 5000+ core-hours to -dev last year. About 500 this year. Stats site says I'm "number 31" in work contributed on this project - but of course, that may be WAY wrong, since it's not being updated...

If the goal of this project is as you originally stated when you started up, to get BOINC volunteers to help, by testing your programs on Macs, different flavors of Windows, Linux, etc., different CPUs, different environments - then you have to actually FIX problems (i.e.; credits) that are pointed out, over and over and over again, instead of just saying "if you don't like it, go away and pester the production project". Or you wind up with a handful of people running Linux on Xeons in-house, and production learns the hard way that, say, Theory won't work on Windows 10... Oops, we didn't have any volunteers to test that!

These ARE things that CAN be fixed, I would assume very easily. Exporting the credits and badges in XML correctly per BOINC standards, as EVERY other project does? And as YOU did, until apparently a month or so ago? Um. Yeah.

Think of it this way. It's payday, but instead of direct-depositing your money in the bank, your boss sends you an email that says "you worked x hours and earned $x this month, great job, you really helped the company". What can you do with that email? Nothing! That's the difference between credits earned here that show up ONLY here, and credits that are exported so they can be "spent" - seen in our sigs, etc. You're already paying way below "minimum wage", now you're insulting us by not even giving us that pittance! If credits are as worthless as you say they are, then why hoard them like gold? (Or anti-matter...) That's not even considering the "out of business" sign on the front door (stats sites showing you "offline" all month) that I guess we should all just somehow know to ignore, and come in to work anyway...
23) Message boards : Number crunching : Credit Issues (Message 5026)
Posted 3 Jul 2017 by Profile Tern
Post:
Another problem - compare these two Theory tasks:

347055 334337 1700 1 Jul 2017, 10:39:22 UTC 2 Jul 2017, 4:16:44 UTC Completed and validated 59,655.44 126,154.80 538.15 Theory Simulation v3.02 (vbox64_mt_mcore)
x86_64-pc-linux-gnu

347053 334335 873 1 Jul 2017, 10:37:14 UTC 1 Jul 2017, 23:56:37 UTC Completed and validated 45,720.30 104,398.40 1,236.30 Theory Simulation v3.02 (vbox64_mt_mcore)
x86_64-apple-darwin

Same day, same time, work units only 2 digits apart, Linux box took longer to run (wall and CPU) but got < 1/2 the credit? Your credit algorithm is SERIOUSLY screwed up. I can't imagine a way that these two tasks were so much different that you can reasonably say the Mac could have done "over twice as much work" in that time.

Suggested fix: If they're all similar, just give 4000 credits per Theory task and be done with it.
24) Message boards : Number crunching : Credit Issues (Message 5025)
Posted 3 Jul 2017 by Profile Tern
Post:
MAJOR problem: Credits here are not being exported in XML, thus are not showing up at the statistics sites, thus are not being updated in our sigs. Likewise badges are not being exported. All work done here is just going into a "black hole" and not being properly credited to the volunteers.

Side effect: The stats site that I use for "what do I need to run now" has shown LHCathome-dev as being "offline" now for over a month. Therefore, I haven't bothered doing any -dev work for the last month. I was surprised when I hit "allow new work" on -dev on one host, more or less by accident, and actually GOT new work, since the project has been showing as "offline" for so long - I'd assumed there were major problems here (which obviously there are) or that you'd shut down! If nobody knows you exist and are functioning - even long-time contributors - it's hard to get volunteers to do the work... and if the volunteers are not going to get the credit in their sigs even if they DO the work...

Has the project even noticed that there aren't any volunteers, and maybe wondered why?

-----

So-so problem: Credits here are still EXTREMELY, ludicrously, insultingly, low. Example:

347053 334335 873 1 Jul 2017, 10:37:14 UTC 1 Jul 2017, 23:56:37 UTC Completed and validated 45,720.30 104,398.40 1,236.30 Theory Simulation v3.02 (vbox64_mt_mcore)
x86_64-apple-darwin

vs:

4796778 2286190 1055 1 Jul 2017, 17:31:29 UTC 2 Jul 2017, 10:22:37 UTC Completed and validated 13,983.65 40,417.26 5,258.61 Amicable Numbers up to 10^20 v2.00 (mt)
x86_64-apple-darwin

Both were mt tasks using 3 cores. Same machine, a few hours apart. -dev task took roughly 3x as long wall time, almost 3x as much CPU time... and got 1/4 the credit. THAT'S A FACTOR OF 12! Even if you think Amicable Numbers gives "too much" credit, comparing to almost ANY other project will give similar results, just not as extreme. As we've tried to tell you since the beginning, you need to at LEAST triple, preferably quadruple (or more), the credit you give, to even be in the ballpark of "correct".

Note the "users in last 24 hours" for Theory: at the moment it's 8. Now look at Amicable Numbers: it's 218. Which project is doing more useful work? Hmm... Which project is MUCH newer, thus has had much less time to attract volunteers? Yep. Amic has only been around a few months. So for all (7?) of you who think "credits don't matter, it's all about the work being done" - great philosophy, but lousy real-world outcome.

-----

Annoyance: My Mac and Linux hosts are churning out Theory tasks, however slowly and for however little payout. My Windows host that's available to -dev right now was returning nothing but errors, even after resetting project. Yes, it's the latest BOINC and VBox, but there's no incentive on my end to further trace the problem and try to fix it, I just hit "no new work" again.
25) Message boards : Number crunching : Specifications for each application (Message 4398)
Posted 2 Dec 2016 by Profile Tern
Post:
ALICE is also not covered in the FAQ.
26) Message boards : Number crunching : Respect My Limits! (Message 4305)
Posted 9 Nov 2016 by Profile Tern
Post:
Simple fix for memory and disk space issues. PUT IT ON THE PREFERENCES PAGE!

If I'd KNOWN that each task required 2GB RAM, I would have known to only allow one at a time. If I'd KNOWN that each application I ran would require 5GB disk, I would have selected only one application at a time to run. COMMUNICATION!!!! :-)

(Would be nice to put up there that 'does not abide by BOINC memory limitations', too. But then the home page for the project still doesn't even mention that VirtualBox is required - volunteers don't find that out until after joining...)
27) Message boards : Number crunching : Credit Issues (Message 4304)
Posted 9 Nov 2016 by Profile Tern
Post:
Just had a vLHC task (#283824) that ran 129,064.72 seconds. (Almost 36 hrs.)
Got 275.59 credits for it.
RIDICULOUS!

PrimeGrid tasks that run that long get between 4 and 5 THOUSAND credits. ClimatePrediction likewise.
THIS MUST BE FIXED! There is NO reason for anyone to waste their CPU cycles on this project if this is the reward they're going to get. Yeah, I know the project doesn't care about credits. Guess what? The users do. No users, no project. Don't pull the "CPU time vs wall time" bit, either. Not the user's problem if the project doesn't do anything with the 36 hours they were given.

Benchmark and Theory and CMS are all 'low credit' but at least REASONABLE. Obviously vLHC is not. I'm giving vLHC one more task, then I'll move on to check ALICE, then probably shelve -dev for another few months. Still hoping you get your act together.
28) Message boards : Number crunching : Respect My Limits! (Message 4291)
Posted 6 Nov 2016 by Profile Tern
Post:
My cycle brought me back to this project and I turned on the new applications. No problems (other than those I've expressed before, design or VBOX related) with CMS, THEORY, or BENCHMARK. Have not yet got an ALICE task to complete successfully, will do that next, have concentrated on LHCb.

Problem: Estimated time on download shows approx 50 minutes. When it starts running, this quickly climbs to 1:10:30:00 or more (over one day, almost one and a half) which causes BOINC scheduler problems - I don't get work I need for other projects until the last minute. Then to aggravate the situation, the job completes successfully anywhere from 3.4 to 6.8 hours later (although estimate never drops accordingly).

Problem: Disk space required is absurd. Very quickly climbs to at least 2GB, eventually will pass 4.5GB! I had to raise the BOINC allocation to even get these to run (found out quickly not to run more than one dev app at a time) and think some crashes may be due to running out of available disk space.

MAJOR problem: Does not abide by BOINC memory utilization settings. Requires 2.07GB RAM. With 4GB present, set at "50% when in use", not only will LHCb not suspend "Waiting on Memory", but other projects will also not suspend. (Appears they don't know how much LHCb is taking - they do suspend if LHCb not present.) Trying to run two LHCb tasks (which should have one running and one waiting) instead runs both and crashes.
29) Message boards : News : Poll (Message 1927)
Posted 9 Feb 2016 by Profile Tern
Post:
What are sub-projects? Is it just this or something else?

https://boinc.berkeley.edu/trac/wiki/PerAppCredit


That is the server-side view of it, yes. For a user-side view, PrimeGrid has sub-projects (14 I think right now, plus several that are retired), each of which gets a "badge" when enough credits are earned in that sub-project; within each sub-project they have multiple applications (which may not be necessary at vLHC) - for example, their "Genefer" sub-project has several applications numbered GFN-15 through GFN-22, based on one of the parameter values of the prime search. Volunteers can choose to run any/all of the applications, depending on their computer power (GFN-15 is pretty quick, GFN-22 can take many days to run, even on my fastest host!!!). Regardless of _which_ "Genefer" applications they run, the credit goes toward a "GFN" sub-project badge.

I would see T4T as one sub-project and CMS as another. You'd only really have one application for each (AFAIK right now) - version numbers don't matter, so whether it's CMS v0.4 or CMS v1.5, it's still CMS. If some future sub-project XYZ wants to simulate two different things, say magnet configuration and atomic mass of target material, XYZ might have apps "XYZ Magnets" and "XYZ Target" - both would go toward earning the volunteer credit toward their "XYZ" badge color.

Likewise if future sub-project ZZZ has CPU and GPU versions of the application - two different applications but one sub-project, and one badge. Users can choose to run either or both applications.

I’m not positive if you can see this without being logged in as me, but at
<http://stats.free-dc.org/stats.php?page=userbycpid&cpid=c5c24472bf7b8fd0675d4a1a454f0104>

right below the basic stats is a banner of badges. The first one (Collatz) is a single-app project, so only one badge. The next block of 4 (Asteroids) please ignore, they do things in a totally screwy way. Then you see the small square boring (hint!) badges that PrimeGrid awards. I have 11 of them in varying colors. At this instant, their “PSP” sub-project badge is bronze, their “GFN” sub-project badge is ruby, their “PPS Sv” badge is teal. That tells the world that I’ve done a whole bunch of PPS Sieve, quite a bit of GFN, and not so much PSP work. They give no clue as to which GFN applications earned me the credits for that GFN badge, but that’s okay, that’s what WuProp is for. And that’s a totally different conversation. :-) I’m not a “badge chaser” - I’ve seen banners that go completely across the page and have many dozens of badges on them. Badges WILL get people to come run your sub-projects, believe me.
30) Message boards : News : Migrating to vLHC@home (Message 1907)
Posted 7 Feb 2016 by Profile Tern
Post:
Hi Bill,

I would suggest that issues relating to the vLHC forum be directed there. Communication and support via the message boards are indeed important and something that is often overlooked.


My point was missed. I don't care about the vLHC forum, as I don't use it, except to find VBox answers that might be there and not here. And therefore, not being attached to vLHC, can't post there. The point was that doing away with the idea of a "dev" project and doing a beta app THERE, would mean dealing with the issues there. Plenty of issues here, why add more? I was giving ammunition in support of having two different projects.

<snipped lots of interesting info about CERN>

My second point was also missed. While I think the communication HERE is FAR better than the communication at vLHC, there is also a shortage of "customer service" going on here. The fact that NO ONE is looking at and responding to the "Q&A" section, was my example. Lots of Q's, no A's. Even after my posting, nobody has gone to look there.

My third point was that CERN (whichever sub-part is irrelevant) just created a massive mess at two projects, vLHC and CMS-Dev, with the ill-timed whatever-it-was. I started to say "release of CMS at vLHC", but no, CMS has been there for a while. Then I started to say "shutdown of CMS-Dev", but that didn't fit either. I'm not sure what was done by who or why, I am just seeing sudden numerous bizarre problems in multiple places. (Which have even multiplied since my original posting.) SOMEBODY (one or more), employer irrelevant, made a really, really poor decision, or series of decisions, or untested code changes, somewhere, this last week or so. We, the volunteers, at both projects, are having to deal with the fallout. Having been through situations like this before, when chaos starts to take over at a project, MY reaction is generally just to back off and wait for the project to get their act together. If I had the time to chase one of the problems so I could contribute, fine, but I have no "spare" time at the moment to be able to help, so better just to stay out of the way. My machines are running unattended, if they are producing data that can be used by someone, great.
31) Message boards : News : Migrating to vLHC@home (Message 1854)
Posted 4 Feb 2016 by Profile Tern
Post:
Regarding vLHC message boards (the need to clean up, before dumping more applications on vLHC):

http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1703
title "Virtualbox not installed"
Posted: 15 Aug 2015, 19:30:45 UTC yet still on Page 1

NEVER ANSWERED. At all. By anyone. Rang a bell with me because I had the same exact problem here. (See my very first posting, http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=89 - which, by the way, was ALSO never answered, other than by volunteer "m", who at least TRIED to help... I still have no answer for the problem, it just "started working" at some point.) Mine is still the one-and-only post over in Q&A:Mac, so it's not like it's gotten buried... AFAIK, nobody at either project has ever even looked at the problem. No telling how many people DID give up, as I was going to if it hadn't fixed itself just before I did.

More problematic:
Very first stickied thread in Number Crunching is "How to end a run on task gracefully". It started in 2012. They still have the problem today (as we do in CMS-Dev) However, the "How to" (post #1) is no longer correct, the "new correct" (but questionable) fix is way down in page 4.

Stickied threads 1500+ days old, if still relevant, indicate a major project problem. If not still relevant, they indicate a major project communication (boards) problem!

Also noted from random checking, most of the problems currently being reported over there, especially by new users, seem VERY familiar. With lots of people saying "THIS NEEDS TO BE COMMUNICATED AT SIGNUP, not buried somewhere in the message boards!" Gee, where have I heard that before?

CMS-Dev message board checking problems exist also, BTW, so it's "CERN-wide", not just vLHC: Poor MarkRBright has been waiting since 16 May 2015 for an answer to his question in Q&A:Windows, and "Agus" has been waiting 7 days now. Understaffing everywhere? Prioritization problems?

Oh yeah, the "CMS Simulation Beta Tasks" thread at vLHC... Yep, CMS was ready to move over there all right! Smooth move, CERN! Looks like they "shut down" (but not really) CMS-Dev, fired up at vLHC, and then immediately MADE MAJOR CHANGES TO THE APP!!!!! WTF?

"Lets spend months in development, alpha and beta testing, get a version 0.4 that almost works (other than a few dozen known bugs we can't seem to fix), decide overnight to renumber it 1.0 and release it. Then the next day, take what was going to be version 0.5, with a ton of totally untested changes, and without any communication whatsoever, call it 1.1 and send it out to thousands of users." 8-(
32) Message boards : News : Updated Agent (Message 1825)
Posted 3 Feb 2016 by Profile Tern
Post:
Great news! Will be VERY helpful.
33) Message boards : Number crunching : VBox issues (Message 1822)
Posted 2 Feb 2016 by Profile Tern
Post:
I must have had to upgrade VBox three or four times


You did not HAVE to.
Virtual box was just reminding you, that there was a new version.( so do a lot of other programs)


Actually, I wouldn't have even looked to see if there WAS an upgrade, since I have no reason whatsoever to ever "run" the VirtualBox application otherwise, and thus be notified of an available upgrade, except that some issue, according to the boards (here or at vLHC, which I started reading because sometimes something pops up there that's relevant), "might" be resolved by upgrading, so I did. Sometimes it solved the problem, other times it didn't, required a new "wrapper" or other project fix, or whatever.

Then... when I didn't immediately get .12 to replace .10, and was "chastised" for not getting it when I got BOINC .22, and was told that VBox .10 "doesn't work well with Windows 10" {!!!!!}, even though BOINC didn't _include_ .12 with .22 until a week later, and didn't publicize the change... I wound up with .14 instead, since apparently .12 was "the current version" for only a few days, which... wait for it... "may be too new - you may have to downgrade to .12"!!! Not happening. I'll run .14 and let everything else catch up. Or not.

So no, I didn't _have_ to upgrade. In that sense, I misspoke. I CHOSE to upgrade, in order to keep running CMS successfully. And believe me, there were many times I almost threw up my hands and gave up, instead. All because of VBox in one way or another. Upgrading VBox on 8 hosts, 3 OSes, spread across 3 physical locations, after-hours, when the furthest-away one tends to crash when you do anything to it "remotely", is non-trivial. I'll be happier when my "data center" (okay, air-conditioned and UPS'd closet) is completed in the new offices, (A/C went in last week - of course, POWER hasn't made it there yet... or a ceiling...) and everything is in one place. With fiber internet!!! And (maybe) another 4790K!!! Sigh. Another month or three. "Real Soon Now" in the famous saying. The other I'm hearing way too much of is "it's only a SLIGHT cost overrun..." thus the "maybe" on the 4790K... which the new hire really needs... URG!

-----
My new sig (actual conversation I had, while working very late one night, and actually way-too relevant to the topic above): "Why are you still there?" "My boss is an a$$%01@." "But you're self-employed!" "Yeah, I know..." :-)
34) Message boards : Number crunching : VBox issues (Message 1818)
Posted 2 Feb 2016 by Profile Tern
Post:
Oops. Missed a big one.

15?) Not following BOINC preference settings, for either task priority or RAM usage. Not "giving back" RAM when host goes in use, possibly locking up host, as Ivan experienced - and not running at "idle" priority like every other BOINC app. Even on hosts that aren't BOINC-only, I like to let BOINC at least _run_ while in use - but with VBox active, I can't do that, I have to shut down BOINC completely (or suspend task) to get adequate response from the host. Makes it totally impractical for "daily use" machines.
35) Message boards : Number crunching : VBox issues (Message 1817)
Posted 2 Feb 2016 by Profile Tern
Post:
Oh yeah - the fact that in the last few months, since I joined this project (and it's my only VBox project) I must have had to upgrade VBox three or four times, from 5.0.8 (or earlier, don't remember) to 5.0.14 now. Seems like they do weekly releases?!?!? SHOULD only have to upgrade on a "major" release - like only when BOINC is updated. Minor ones shouldn't break everything!
36) Message boards : Number crunching : VBox issues (Message 1816)
Posted 2 Feb 2016 by Profile Tern
Post:
Off the top of my head and in no particular order:

1) "VM Hypervisor Failed to Enter..."
2) Host returning bad job results: "Sounds like a corrupt VM, try resetting the project"
3) "VirtualBox is not installed" message for days, when it is, and no clue how to fix (or how it got fixed, it just started working)
4) Certain version (older or newer) of VBox required in order to work correctly depending on the "wrapper of the week" at the project. No way to know what version is wanted. BOINC provides one version, project wants another, and "Check for updates" in VBox gives you yet another. None happen automatically (which I understand is for security, just like BOINC itself must be manually updated, while everything _else_ can be automatic. Must have one "safe point" in the loop. But now we have two.)
5) (Actually partially part of 'no user feedback from project on task results'.) No error message when version of VBox is "wrong" and causing failures
6) Disk thrashing - mainly on SSDs - extremely high number of reads/writes per minute, totally unnecessary for project, just "because it's VBox". Supposedly getting better with newer versions...
7) Requirement to leave suspended tasks in memory
8) Not pausing/shutting down/whatever correctly when host rebooted, task swapped out/suspended, BOINC restarted, etc.
9) "Missing Application - download and install CoRD from sourceforge" when hit "Show VM Console" on Mac
10) On Mac, VBox in BOINC runs as a different user - can't look at console anyway. Unless that's what "Show VM Console" is supposed to do...
11) Need to look at VM Console AT ALL, for any reason. Should be completely in background! Make any problems visible in BOINC Messages.
12) Linux OS update results in "you must recompile xxxx to get VBox working again" Help from boards was "you should install xyzzy to auto-recompile xxxx every time". Come on folks, your grandmother is supposed to be able to run this! (Yes, even on Linux.)
Lucky 13) Whatever problems will appear, unexpectedly, in the NEXT version of VBox, that nobody, project staff, admins, volunteers, etc. are expecting until it happens. In other words, nobody on the BOINC side is involved on the Oracle side to do any beta testing or anything else. At least with the OSes, Apple/MS/etc provide betas that someone _could_ look at, and given the size of the audience, usually that includes plenty of BOINCers.
37) Message boards : News : Poll (Message 1795)
Posted 2 Feb 2016 by Profile Tern
Post:
option 2++ with

- sub-projects in prod
- option to test beta WUs in the same project, for those interested
- separate project for alpha testing, for those interested

You can mix VB and no-VB sub-project as long as you clearly explain the requisites, and easy option on the project preference page to allow each user select / unselect its own projects (plus beta yes/no).

All this on a profile basis (home / work / etc) like standard boinc setup, of course.

Primegrid works very well like that, with a long list of sub-project with their own prerequisites (very detailed) and you can check precisely those you want to do.


I agree in full with this option. vLHC becomes the "umbrella" for sub-projects T4T, CMS, etc. - credit kept separately by subproject for the stats sites ala PrimeGrid and many others (but as mentioned elsewhere, unlike SETI). PrimeGrid is, as said, THE one to look at for Preferences pages to select your sub-projects. CMS-Dev becomes vLHC-Dev, basically a clone of the 'new' vLHC but with sub-projects that come and go as new applications are developed by CERN. Message board folders in Dev for each separate application.

I personally don't see the need for beta testing in the main project at all, but some like it, so whatever. It does let the main user base "try out" an app early if they so desire, but it isn't a substitute for being "production-ready" before leaving the Dev sandbox.
38) Message boards : News : Constructive suggestions please (Message 1750)
Posted 31 Jan 2016 by Profile Tern
Post:
IIRC (I can't check on my home machines, I don't have the bandwidth to run the project).


Ivan, if one sentence ever summarized the current state of CMS, that would have to be the one...
39) Message boards : News : Constructive suggestions please (Message 1749)
Posted 31 Jan 2016 by Profile Tern
Post:
I'm surprised at your experience of how often an idle task polls the server. I won't argue with your logs, it seems that's something for CERN IT to comment upon. From the above, I'd guess that it's an HTCondor matter, and perhaps there's an adjustable parameter to change its default timing which almost certainly assumes a local cluster of compute nodes.


Just dumped the logs again - it's (maybe) MORE than once per second. Get one, two, or three "cmsfrontier" accesses before the seconds counter ticks over to the next second. This was from 1800 UTC-6 on (an hour ago, roughly). Possible that's from three hosts, I think I have 3 tasks in progress, although I thought one or two were swapped out - which would work out to the once-per-second-per-host I saw before. I scrolled through a few hundred screenfuls of the log, reading every 'nth' page, it was consistent.

Not the place for it, but... Did notice earlier that a couple of more (Windows) CMS tasks are now well past the 24 hour mark, and the one here on the Mac laptop is showing 7 hours done and only 7 hours remaining. Bizarre. Not aborting anything, just to see what happens. The "run-ons" started after updating to BOINC 7.6.22 but leaving VBox at whatever, 10 I think, from the December BOINC release. Windows (10) won't let me update to VBox 14 (the current one, although the BOINC version is 12 - VBox automatically goes to 14 when you check for updates) because the "file certificate is bad" on the download. Thank you Oracle, this is the same problem as an update a few months ago! Did get the Macs updated with no problems. Going to manually hack in the 14 update on the Windows boxes next. Sigh. They also all want to reboot for Windows updates. Last week I didn't catch the message and they did it at some random time anyway.

I don't think the "running past 24 hours" problem was on the original list - it definitely needs to be, and at a pretty high priority. Nothing like taking a slot for days and then telling the volunteer "oh, just abort it" to make people happy...
40) Message boards : News : Constructive suggestions please (Message 1743)
Posted 30 Jan 2016 by Profile Tern
Post:
Bill, some of your posts display a few misconceptions about how CMS-dev works


No problem here, I know that I don't know... have to make assumptions on a lot of things, the more info you and Laurence give, the fewer assumptions and the closer I hope I get to understanding.

... join its pool of worker nodes. My batch of jobs have been put in the HTCondor queue ... The server sends a job out to the client (i.e. volunteer's host), and it goes through the whole processing chain. At the end the client sends the result file to the data-bridge at CERN and reports logs and status back to RAL.


The pool, etc, are at least somewhat similar to what cgminer does with BitCoin pools, so I have some concept here. (Along with how BitcoinUtopia wraps that up for BOINC.) Thanks!

Questions (no rush on these, I know you have plenty to do, just something to think about later) - I see several log files in my Slots directories being created as a task runs. Are all of these sent back to "RAL"? If this is part of the network load, are they all really needed on your end? Could some be eliminated? Sent only if there is an error? Or is the whole load, or the large majority of it, the result (ROOT) file?

It's then available again in the HTCondor pool for another job, and so on, until the host task reaches its (arbitrary) 24-hour lifetime. Should contact be lost with the server (this appears to include the user suspending the task, or stopping BOINC, etc.), the server (in my incomplete understanding) abandons the job and puts it back at the head of the queue. ...


It is the "contact lost with the server" point where we (well, someone who can change code...) need a lot more information, eventually. Is this a handshake while a job is running? Is this contact only started when there is a result file to send back, and then a timer starts that errs if the result file is not completely back in "x" seconds? Is it just FTP with a minimal (or no) packet retry count? There is NO way to stop BOINC from swapping out a CMS task, at any point. You can ASK that it be kept in memory while swapped out, but many may not even do that. PCs _will_ reboot, or have power failures, or users who suspend tasks, or quit BOINC for a few hours/days/forever. Laptops will suddenly be on battery. Desktop PCs will go "in use" when prefs say to run BOINC only while idle. Other apps will saturate a network (I'm thinking running an online game on one machine while another runs BOINC, just for one example.) So, if you want "mostly good" job results, this is where something has to give at your end. The other choice is to shrug and write off the jobs, and just don't worry about it - they'll be resent. As long as the volunteer isn't penalized in some way (at least not much) when this happens, in other words gets credit for the CPU time (not slot time) even though the job (not the task) "failed", it won't affect anything at our end.

The result files are ROOT files (root.cern.ch) and are, to the best of my knowledge, already highly compressed; further attempts at compression are likely to be counterproductive.


Meh. Then the question becomes "what is in the file, and what is actually needed, and how much is fluff that doesn't matter but isn't a problem anywhere but BOINC". If absolutely nothing can be reduced, there's a real problem here that MAY make CMS totally unusable on BOINC... or at the least very drastically reduce the number of volunteers that CAN produce good results.

I'm surprised at your experience of how often an idle task polls the server. I won't argue with your logs, it seems that's something for CERN IT to comment upon. From the above, I'd guess that it's an HTCondor matter, and perhaps there's an adjustable parameter to change its default timing which almost certainly assumes a local cluster of compute nodes.


Yeah, once a minute would be great and wouldn't hurt anything much. Once every 5-10 seconds would be livable. I'm going on what I saw when chasing the "frontier blocked" error, which was a while back, so something may have changed since, or the once-every-second thing may have been BECAUSE it was being blocked, or, or... but it's something to look into.

I'll get back with more, once I have a chance to dig into how BU handles wrapping it's tasks for the BitCoin Pools. I suspect (assuming again) that I know where the "server load" issue came from that made CERN put more than one "job" in a "task" to begin with, and THAT may be fixable...


Previous 20 · Next 20


©2024 CERN