Message boards :
Number crunching :
Multiple Jobs In A Single Host
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
At the moment we have set the max_jobs_in_progress to be 1. This means that when you join this project only one VM should be started no matter how many cores you have available. This has been done to avoid being greedy and swamping your machine with VMs. Are any of you running more than one CMS-dev VM on your host and if so how did you manage to override this parameter? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- e.g. I have 20-core servers with 128 GB of RAM that should be able to run more than one instance. (Not to mention that 60-core, 240-thread Xeon Phi languishing in the lab, tho' I don't think VirtualBox will run in its limited resources... :-) |
Send message Joined: 6 Mar 15 Posts: 19 Credit: 142,109 RAC: 0 |
I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- Oh yes, please please please!!! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- Just keep in mind that this is a very resource-intensive project, even as we're running it in pre-beta. I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. Also, we do extensive network traffic, in the tens to hundreds of megabytes. This is really not viable on my home network, as I have (had! It's been very flaky since the Sunday before Easter) a maximum of 6 Mbps, or roughly 33 MB/min. Given that 6 Mbps just barely covers a 720p video stream from iPlayer, I get some interuptions to the BBC's News at Six. Luckily I have the option now of fibre-to-the-cabinet, and the cabinet lives in my front hedge, so I can get 60 Mbps or so if I ever get the round tuits. At a guess, I'd say one needs at least a 10 Mbps connexion to avoid CMS@Home intruding excessively into one's other (home -- I have up to 1 Gbps at work :-) network activities. There's also the issue of disk space usage, but I haven't really quantified that yet. And of course memory, our current VMs have a 1 GB memory space each. And don't forget, all of these parameters are subject to change once we start processing "real-life" work flows. I don't mean to be a Cassandra but, all in all, this will never be a set-and-forget project like SETI@Home can usually be. |
Send message Joined: 5 Apr 15 Posts: 3 Credit: 1,606,870 RAC: 0 |
I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. This just happened to me with my first two CMS-dev wu's except it were work from other projects that was getting computation errors due to lack of disk space. This had me be puzzled until I found the two vdi's left in the slots directory which meant that between the slots and the regular files CMS-dev was taking more than 12.5GB of space on the 128GB ssd I use for Boinc, this without one running at the moment. Now that I know to keep an eye out for Leftovers it won't be a problem for me but others might be in for an unpleasent surprise if this isn't fixed. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. Well, hopefully we've found the cause for that -- see this thread if you haven't already. Retournons a nos moutons! The question of running more than one task at a time has arisen again. It doesn't make much sense at present given that we're not producing "real" data yet, but how to do it but not overload smaller machines? vLHC allows two tasks to be active at a time. In principle the user can use app_config.xml in the project directory to limit the number of concurrent jobs. I just tried on both a Windows and a Linux machine, using <app_config> <app> <name>vboxwrapper</name> <max_concurrent>1</max_concurrent> </app> </app_config> The documentation says that the project needs to be reset if this is changed but I had to stop and reset BOINC as well. Oops, the Windows machine is the one that won't start a CMS task properly, and it looks like it's having the same problem with vLHC -- "VM Hypervisor failed to enter an online state in a timely fashion". The Linux box is now running just one task though, with the other in "waiting to run" state. I'll let it play for a couple of days and see what happens. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,324,905 RAC: 1,506 |
Yeah over at vLHC the usual way members get one or two tasks is just setting the preferences on their account. Most still run X2 but we still have a few that just run one tasks at a time. And there we can also run the regular tasks or the new Databridge tasks. Since on occasion one type will not have any tasks so mine is set to run X2 with either version which usually keeps my hosts running (like right now when the Databridge tasks are empty so mine switched back to the other version) http://lhcathome2.cern.ch/vLHCathome/top_hosts.php?sort_by=total_credit That would work here too.....now or in the future. (and nice not having those "snapshots" with Databridge) Mad Scientist For Life |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order. Now the question is, can we ship a one-job-at-a-time default file, and then raise the per-PC job limit so that people who want, and have the capacity, to run more than one can easily edit the file to that end? |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order. Where are you seeing anything about resetting the project? My recipe for doing that would be i) Create the file ii) Issue the 'Read config files' command from BOINC Manager If you have two tasks running at once, and set <max_concurrent> to 1, one of them will stop. Simple as that. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration: "If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values." Perhaps I'm misreading it. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,145,905 RAC: 1,205 |
That's what it says ! I do know that if all you change is the max_concurrent value then just doing a Read config files will immediately take effect. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
I think the operative word in that note might be 'remove' - either the whole file, or some entry types within it. I suspect some entries, notably the thread count settings for MT applications, may have a delayed impact - they might only come into effect when a new task is started, or when new work is fetched from the project concerned. And the GPU count is slow to display, even though it comes into effect immediately. Altogether, it's a slightly clumsy and perhaps unfinished (though incredibly useful) mechanism. I do have an editing account for that Wiki, so if I can think of a better form of words (or if anyone else can suggest one), I can post it. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration: HM this differs on what you are changing: If you allow one or more cores or take them away it is enough to say "Read config files" and BOINC can react on it. If you change the number of CPUs a VM can use then this will work only with the next started VM / WU, so resetting may be a good interaction. For running VMs the number of allowed cores is not changed |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. Looks as though this may have been fixed in v7.6.9. From the version history "client: fix job scheduling bug that starves CPU instances ". I haven't specifcally tested it but it is running OK on Atlas so far. Thanks, Rom. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. That was a very specific bug - introduced by mistake in v7.6.3 - that Cliff Harding found and we worked through from here. Unless you were seeing similar symptoms in cpu_sched_debug logging - specifically like [cpu_sched_debug] using 2.00 out of 6 CPUs I doubt this change is relevant to you. Work fetch and app_config.xml still aren't hooked up. |
Send message Joined: 17 Aug 15 Posts: 17 Credit: 228,358 RAC: 0 |
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). Quite true. It is not a problem here that I have found thus far, since the limit set for the CMS tasks is only "1". But to prevent CPU starvation on other projects, you can adjust the "resource share" for each project accordingly. For example I have 6 cores available (2 are reserved for GPUs), and if I want 2 cores for Project A and 4 cores for Project B, I use the app_config to limit Project A to 2 work units max. Then, I adjust the resource share to 50% for Project A, which sets those downloads to 1/3 of the total. I don't need to do anything for Project B, and it all works out well most of the time. Occasionally, there are mis-estimates for the running time, but they get corrected eventually. You don't actually need the app_config at all in that case if you are willing to live with long-term averages, but there are projects that take a lot of memory where I do want a limit at all times. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. You're right.Thanks Richard |
Send message Joined: 19 Aug 15 Posts: 46 Credit: 3,626,553 RAC: 92 |
I followed these instructions for vLHC to setup all my computers to run >1 task. I still use it on one computer to crank out more work. http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1154&postid=13411#13411 On Atlas@Home I make use of app_config to tune the WU's that my PC's can handle based on the amount of RAM, as it stomps your computer if you let it run free. The only computer that's close to working right is the on with 80Gb of ram. |
©2024 CERN