Message boards :
News :
Constructive suggestions please
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 597 |
As mentioned elsethread, I have to prepare a summary of required/desired improvements to CMS@Home to take it up to production readiness. Please post suggestions and criticisms in this thread. Please keep it short and non-personal, as I'll have to de-serialise the thread to make my report. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 597 |
Here's a list from one of Laurence's posts to be getting on with. Please feel free to add to or comment on the items.
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
By far the greatest issue to resolve is failing hosts. They need to be: a)notified by failing the boinc task. b)informed, that something is wrong, that they can take corrective action c)banned, if they do not respond and continue to produce failed results. If there is a failure on the server side host vm's should be paused(if possible) and resumed after the server issue is corrected. with b) it needs to be determined by what means and what about. with c) quite drastic, but maybe for a limited period of time. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
By far the greatest issue to resolve is failing hosts. I mostly agree with this, except maybe the "greatest" part, the network issues are bigger for me personally - the terminology is "host back-off" in BOINC, and there's a method of doing it. I posted a long-winded (of course!) answer to Laurence's list in the other thread, including this issue; I can copy it here if you like, or you can grab it there... I didn't see this thread in time. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Bill. I have read it. "host back-off" sounds a bit harmless, but so be it. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
By far the greatest issue to resolve is failing hosts. BOINC is probably not aware of the network issues. It's the VM uploading the job-results and on a slow line (evt. more client-VM's over the same line) the upload line is saturated, causing a time-out on the CMS-server and dedicated the result as failed. BOINC itself normally contacts the CMS-server every 24 hours to report the task finished (only a few bytes) and ask a new task. Sending the results by BOINC would result in the same saturation and failing jobs. One could first try to use a (better) compression method before uploading?? If that would not lead to a big improvement, the server should not be so critical when results are coming in parts. The average upload speed of each client is reported by the hosts and I can't imagine that there is no BOINC-server setting to exclude clients with too low bandwidth for getting a CMS-task. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Sending the results by BOINC would result in the same saturation and failing jobs. You are absolutely right that using BOINC does nothing for the QUANTITY of data. There compression is very much needed, especially if not already in place, or a drastic reduction in what's sent back. Is all that data really needed? How much is overhead that could go away if "batched" instead of "real time"? Who knows? I was astounded to see my uplink saturated (it's only DSL, but still!) by five or six hosts running one CMS task each. That's a LOT of data being created! Even on _download_, I see a problem. CMS "polls" the server looking for work once per SECOND when it doesn't have any. That's ridiculously excessive; my router log files overfloweth. (It was this "is there work" URL that my new security software didn't like, which of course meant the "blocked" message went to dev/null {or the VM, same thing} instead of to a browser, which made that fun to find...) On job failures, I don't see how sending via BOINC would have the same problem, or not to the same extent. BOINC will retry as needed if there's a transmission failure. Currently I guess there are no retries from within the VM. (?) This fix assumes that the server, as you say, MUST NOT be so time-critical since the current problem is the data must be "now" or it's a job failure. It's not the retries themselves that would be the "fix", but the fact that the server wouldn't (couldn't) be so demanding, and if the data didn't arrive until tomorrow, that's still okay, as long as it arrives. (Which assumes we aren't generating data faster than we can send it, of course.) Each task would only be sending one large file, instead of many small ones or a stream, which should also improve failure rates a bit. Given the difficulty of changing this part, though, I would be looking very strongly at some way to reduce the volume of data transferred first, since that will really be needed anyway. (vLHC allows two CMS tasks per host? Fun.) I don't know if the "average upload speed" check would help or not. I would guess that each of my hosts report roughly the same value, which would be the TOTAL available upload speed, which is more than enough. (For one host, even with CMS.) Doubt it's smart enough to count the hosts on that line and divide... or somehow test all the hosts simultaneously. Only way that would completely solve the problem is if all the hosts were running CMS at the time the data was collected, or 24/7 if it's a running average; "average upload speed 0 bps". :-( Then I'd never get any work. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 597 |
Bill, some of your posts display a few misconceptions about how CMS-dev works; that's OK, I'm hazy on some of the technical details myself. I don't have time to do a full exposition, but the TL;DR version goes like: the VM sets itself up as an HTCondor client (gwgl HTCondor if you have the time) and contacts a server at RAL to join its pool of worker nodes. My batch of jobs have been put in the HTCondor queue at that server by a process that's irrelevant here, but let's call it CRAB3 in case you see that acronym elsewhere. The server sends a job out to the client (i.e. volunteer's host), and it goes through the whole processing chain. At the end the client sends the result file to the data-bridge at CERN and reports logs and status back to RAL. It's then available again in the HTCondor pool for another job, and so on, until the host task reaches its (arbitrary) 24-hour lifetime. Should contact be lost with the server (this appears to include the user suspending the task, or stopping BOINC, etc.), the server (in my incomplete understanding) abandons the job and puts it back at the head of the queue. Should the job return an error code, the server schedules it for a retry, so long as it's not already been tried three times, and puts it at the tail of the job queue. The result files are ROOT files (root.cern.ch) and are, to the best of my knowledge, already highly compressed; further attempts at compression are likely to be counterproductive. I'm surprised at your experience of how often an idle task polls the server. I won't argue with your logs, it seems that's something for CERN IT to comment upon. From the above, I'd guess that it's an HTCondor matter, and perhaps there's an adjustable parameter to change its default timing which almost certainly assumes a local cluster of compute nodes. Cheers, ivan |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I'd like to add one. That a properly working, automatic job submission (from whatever system the experiment uses) and proxy generation system be implemented. At present this seems to take a lot of manual work and when it fails (which often relies on volunteers to notice) the reasons often aren't found, let alone fixed. A production system has got to run largely unattended. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Bill, some of your posts display a few misconceptions about how CMS-dev works No problem here, I know that I don't know... have to make assumptions on a lot of things, the more info you and Laurence give, the fewer assumptions and the closer I hope I get to understanding. ... join its pool of worker nodes. My batch of jobs have been put in the HTCondor queue ... The server sends a job out to the client (i.e. volunteer's host), and it goes through the whole processing chain. At the end the client sends the result file to the data-bridge at CERN and reports logs and status back to RAL. The pool, etc, are at least somewhat similar to what cgminer does with BitCoin pools, so I have some concept here. (Along with how BitcoinUtopia wraps that up for BOINC.) Thanks! Questions (no rush on these, I know you have plenty to do, just something to think about later) - I see several log files in my Slots directories being created as a task runs. Are all of these sent back to "RAL"? If this is part of the network load, are they all really needed on your end? Could some be eliminated? Sent only if there is an error? Or is the whole load, or the large majority of it, the result (ROOT) file? It's then available again in the HTCondor pool for another job, and so on, until the host task reaches its (arbitrary) 24-hour lifetime. Should contact be lost with the server (this appears to include the user suspending the task, or stopping BOINC, etc.), the server (in my incomplete understanding) abandons the job and puts it back at the head of the queue. ... It is the "contact lost with the server" point where we (well, someone who can change code...) need a lot more information, eventually. Is this a handshake while a job is running? Is this contact only started when there is a result file to send back, and then a timer starts that errs if the result file is not completely back in "x" seconds? Is it just FTP with a minimal (or no) packet retry count? There is NO way to stop BOINC from swapping out a CMS task, at any point. You can ASK that it be kept in memory while swapped out, but many may not even do that. PCs _will_ reboot, or have power failures, or users who suspend tasks, or quit BOINC for a few hours/days/forever. Laptops will suddenly be on battery. Desktop PCs will go "in use" when prefs say to run BOINC only while idle. Other apps will saturate a network (I'm thinking running an online game on one machine while another runs BOINC, just for one example.) So, if you want "mostly good" job results, this is where something has to give at your end. The other choice is to shrug and write off the jobs, and just don't worry about it - they'll be resent. As long as the volunteer isn't penalized in some way (at least not much) when this happens, in other words gets credit for the CPU time (not slot time) even though the job (not the task) "failed", it won't affect anything at our end. The result files are ROOT files (root.cern.ch) and are, to the best of my knowledge, already highly compressed; further attempts at compression are likely to be counterproductive. Meh. Then the question becomes "what is in the file, and what is actually needed, and how much is fluff that doesn't matter but isn't a problem anywhere but BOINC". If absolutely nothing can be reduced, there's a real problem here that MAY make CMS totally unusable on BOINC... or at the least very drastically reduce the number of volunteers that CAN produce good results. I'm surprised at your experience of how often an idle task polls the server. I won't argue with your logs, it seems that's something for CERN IT to comment upon. From the above, I'd guess that it's an HTCondor matter, and perhaps there's an adjustable parameter to change its default timing which almost certainly assumes a local cluster of compute nodes. Yeah, once a minute would be great and wouldn't hurt anything much. Once every 5-10 seconds would be livable. I'm going on what I saw when chasing the "frontier blocked" error, which was a while back, so something may have changed since, or the once-every-second thing may have been BECAUSE it was being blocked, or, or... but it's something to look into. I'll get back with more, once I have a chance to dig into how BU handles wrapping it's tasks for the BitCoin Pools. I suspect (assuming again) that I know where the "server load" issue came from that made CERN put more than one "job" in a "task" to begin with, and THAT may be fixable... |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
@Bill Did you ever pressed the "Show graphics" button in BOINC Manager on a running CMS-dev task. You'll be directed to VM-machine logs. There is a lot of data, surely raising even more questions ;) I found there, that each job of the current batch (250 events/job) will create results of about 66MB of data. The former batch with 150 events/job created about 45MB result data to upload. You wrote slots. I suppose you meant BOINC data slot directories. Nothing from there is sent to the Server. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 597 |
I see that CP beat me to the punch on a couple of points... Questions (no rush on these, I know you have plenty to do, just something to think about later) - I see several log files in my Slots directories being created as a task runs. Are all of these sent back to "RAL"? If this is part of the network load, are they all really needed on your end? Could some be eliminated? Sent only if there is an error? Or is the whole load, or the large majority of it, the result (ROOT) file? Actually you may see more than that if you run boincmgr, select a running CMS task and click on "Show Graphics". That starts up a web page in your default browser showing logs from a Web-server built into the VM. IIRC (I can't check on my home machines, I don't have the bandwidth to run the project) the log that I get back on the RAL Condor server (RAL = Rutherford Appleton Laboratory, a science lab in Oxfordshire) is called _condor_stdout on the VM. It's typically 120 kB and incorporates several of the logs including the cmsRun stdout log but also lots of debug stuff about the file stage-out and the infamous FINISHING line I often quote with the exit status. The ROOT file contains all the information about the response the whole CMS detector (i.e. its several subdetectors) sees as a result of the proton-proton collisions being simulated. In a targeted investigation some of this may not be needed and the configuration file can request that various datasets be "dropped" from the output, but what I'm running at present is a general background process and all information is retained. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
I'm surprised at your experience of how often an idle task polls the server. I won't argue with your logs, it seems that's something for CERN IT to comment upon. From the above, I'd guess that it's an HTCondor matter, and perhaps there's an adjustable parameter to change its default timing which almost certainly assumes a local cluster of compute nodes. Just dumped the logs again - it's (maybe) MORE than once per second. Get one, two, or three "cmsfrontier" accesses before the seconds counter ticks over to the next second. This was from 1800 UTC-6 on (an hour ago, roughly). Possible that's from three hosts, I think I have 3 tasks in progress, although I thought one or two were swapped out - which would work out to the once-per-second-per-host I saw before. I scrolled through a few hundred screenfuls of the log, reading every 'nth' page, it was consistent. Not the place for it, but... Did notice earlier that a couple of more (Windows) CMS tasks are now well past the 24 hour mark, and the one here on the Mac laptop is showing 7 hours done and only 7 hours remaining. Bizarre. Not aborting anything, just to see what happens. The "run-ons" started after updating to BOINC 7.6.22 but leaving VBox at whatever, 10 I think, from the December BOINC release. Windows (10) won't let me update to VBox 14 (the current one, although the BOINC version is 12 - VBox automatically goes to 14 when you check for updates) because the "file certificate is bad" on the download. Thank you Oracle, this is the same problem as an update a few months ago! Did get the Macs updated with no problems. Going to manually hack in the 14 update on the Windows boxes next. Sigh. They also all want to reboot for Windows updates. Last week I didn't catch the message and they did it at some random time anyway. I don't think the "running past 24 hours" problem was on the original list - it definitely needs to be, and at a pretty high priority. Nothing like taking a slot for days and then telling the volunteer "oh, just abort it" to make people happy... |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
IIRC (I can't check on my home machines, I don't have the bandwidth to run the project). Ivan, if one sentence ever summarized the current state of CMS, that would have to be the one... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 597 |
IIRC (I can't check on my home machines, I don't have the bandwidth to run the project). Sure, but it's more an indictment of the state of broadband in the UK. I'm currently getting about 3.6 Mbps down, 1 Mbps up; up until last Easter I was getting 6 Mbps, for a long time thereafter less than 3 Mbps. My ISP has just been acquired by BT (that makes 4 or 5 takeovers since I started...) and I'm hoping for an "introductory" offer soon. The local distribution box is by my front fence and I believe it is fibre-equipped. With a bit of luck I could have ~60 Mbps FTTC or even FTTP by the end of the year. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
I had already posted this in a different thread:
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Communications. This is a BIG issue here. Changes are made, without prior announcements. There are still unsolved issues here and as a good measure vlhc cms-task a back on again.No information, if the problem here has been solved or anything. Not even the people at admin are talking to each other. Is it just me or is this somewhat chaotic? Please feel free to disagree. So, my suggestion is: Please announce changes. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Communication is always the issue. My view on distributed computing is that it is more about sociology than technology as before you get services to communicate you have to get their developers and operators to communicate. Here we have a distributed system with people located in difference places (and countries) with different priorities, working together as a collaboration to achieve a common goal. Parts of the system can stop working after unannounced changes such as an OS upgrade. As the services that we dependend upon are run by different teams (not even within the same organisation let along management lines) we may not know when or what is going to change and only see the consequences if things don't go according to plan. This is the same for any online service that one may use such as Google or Facebook. You are not told the details about the service provision, only experience the quality of service provided. Also as operations effort is prioritised by impact, large capacity production services are considered first in line and hence smaller capacity pilot services my not receive the support required. Thus we could potentially have a catch 22 situation where we can't scale up as we don't have enough operational support but won't get the required amount of operational support until we scale up. So I would not say that this is chaotic, just the nature of distributed computing based upon multiple independent service. But you are right in pointing out that communication is key. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Laurence, That clarifies a few things. Somewhat tricky. I understand, that currently the volunteers resources are deployed to crunch for some other section of cern? We are not working on the current batch? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The resources are used to crunch (generate simulations of single collisions) for the CMS experiment. Quoting from http://home.cern/about/experiments/cms The CMS experiment is one of the largest international scientific collaborations in history, involving 4300 particle physicists, engineers, technicians, students and support staff from 182 institutes in 42 countries Ivan can probably say more about the CMS internals. Simulation is just one activity. They have to read and store the data from the detector (~30PB/y for all the experiments) and both simulated and real data follow the same analysis chain. This requires a global computing infrastructure to link all the data centres at those 182 institutes into one seamless system that is in operation 24/7. What we are trying to do is to plug into that system to syphon off the jobs that are suitable (non-data intensive) to be run here. |
©2024 CERN