Thread 'Multiple Jobs In A Single Host'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 3	Message 236 - Posted: 8 Apr 2015, 15:00:53 UTC At the moment we have set the max_jobs_in_progress to be 1. This means that when you join this project only one VM should be started no matter how many cores you have available. This has been done to avoid being greedy and swamping your machine with VMs. Are any of you running more than one CMS-dev VM on your host and if so how did you manage to override this parameter? ID: 236 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 237 - Posted: 9 Apr 2015, 0:21:52 UTC - in response to Message 236. I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- e.g. I have 20-core servers with 128 GB of RAM that should be able to run more than one instance. (Not to mention that 60-core, 240-thread Xeon Phi languishing in the lab, tho' I don't think VirtualBox will run in its limited resources... :-) ID: 237 · Rating: 0 · rate: / Reply Quote

Steve Hawker* Send message Joined: 6 Mar 15 Posts: 19 Credit: 142,109 RAC: 0	Message 238 - Posted: 9 Apr 2015, 0:39:58 UTC - in response to Message 237. I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- Oh yes, please please please!!! ID: 238 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 239 - Posted: 9 Apr 2015, 1:42:26 UTC - in response to Message 238. I'm not, but when we get closer to production I'd like to be able to select myself how many to run -- Oh yes, please please please!!! Just keep in mind that this is a very resource-intensive project, even as we're running it in pre-beta. I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. Also, we do extensive network traffic, in the tens to hundreds of megabytes. This is really not viable on my home network, as I have (had! It's been very flaky since the Sunday before Easter) a maximum of 6 Mbps, or roughly 33 MB/min. Given that 6 Mbps just barely covers a 720p video stream from iPlayer, I get some interuptions to the BBC's News at Six. Luckily I have the option now of fibre-to-the-cabinet, and the cabinet lives in my front hedge, so I can get 60 Mbps or so if I ever get the round tuits. At a guess, I'd say one needs at least a 10 Mbps connexion to avoid CMS@Home intruding excessively into one's other (home -- I have up to 1 Gbps at work :-) network activities. There's also the issue of disk space usage, but I haven't really quantified that yet. And of course memory, our current VMs have a 1 GB memory space each. And don't forget, all of these parameters are subject to change once we start processing "real-life" work flows. I don't mean to be a Cassandra but, all in all, this will never be a set-and-forget project like SETI@Home can usually be. ID: 239 · Rating: 0 · rate: / Reply Quote

LCB001 Send message Joined: 5 Apr 15 Posts: 3 Credit: 1,606,870 RAC: 0	Message 240 - Posted: 9 Apr 2015, 13:49:17 UTC I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. This just happened to me with my first two CMS-dev wu's except it were work from other projects that was getting computation errors due to lack of disk space. This had me be puzzled until I found the two vdi's left in the slots directory which meant that between the slots and the regular files CMS-dev was taking more than 12.5GB of space on the 128GB ssd I use for Boinc, this without one running at the moment. Now that I know to keep an eye out for Leftovers it won't be a problem for me but others might be in for an unpleasent surprise if this isn't fixed. ID: 240 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 408 - Posted: 26 May 2015, 15:31:33 UTC - in response to Message 240. Last modified: 26 May 2015, 15:36:12 UTC I had a case recently where tasks were failing because a 5+ GB VM was left in a slot directory where subsequent tasks were run; as they accumulated results the directory exceeded 10 GB, a limit in our set-up, and the tasks failed. This just happened to me with my first two CMS-dev wu's except it were work from other projects that was getting computation errors due to lack of disk space. This had me be puzzled until I found the two vdi's left in the slots directory which meant that between the slots and the regular files CMS-dev was taking more than 12.5GB of space on the 128GB ssd I use for Boinc, this without one running at the moment. Now that I know to keep an eye out for Leftovers it won't be a problem for me but others might be in for an unpleasent surprise if this isn't fixed. Well, hopefully we've found the cause for that -- see this thread if you haven't already. Retournons a nos moutons! The question of running more than one task at a time has arisen again. It doesn't make much sense at present given that we're not producing "real" data yet, but how to do it but not overload smaller machines? vLHC allows two tasks to be active at a time. In principle the user can use app_config.xml in the project directory to limit the number of concurrent jobs. I just tried on both a Windows and a Linux machine, using <app_config> &ltapp> <name>vboxwrapper</name> <max_concurrent>1</max_concurrent> </app> </app_config> The documentation says that the project needs to be reset if this is changed but I had to stop and reset BOINC as well. Oops, the Windows machine is the one that won't start a CMS task properly, and it looks like it's having the same problem with vLHC -- "VM Hypervisor failed to enter an online state in a timely fashion". The Linux box is now running just one task though, with the other in "waiting to run" state. I'll let it play for a couple of days and see what happens. ID: 408 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 834 Credit: 15,369,755 RAC: 9,900	Message 409 - Posted: 26 May 2015, 20:52:54 UTC - in response to Message 408. Yeah over at vLHC the usual way members get one or two tasks is just setting the preferences on their account. Most still run X2 but we still have a few that just run one tasks at a time. And there we can also run the regular tasks or the new Databridge tasks. Since on occasion one type will not have any tasks so mine is set to run X2 with either version which usually keeps my hosts running (like right now when the Databridge tasks are empty so mine switched back to the other version) http://lhcathome2.cern.ch/vLHCathome/top_hosts.php?sort_by=total_credit That would work here too.....now or in the future. (and nice not having those "snapshots" with Databridge) Mad Scientist For Life ID: 409 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 411 - Posted: 27 May 2015, 21:20:14 UTC - in response to Message 409. Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order. Now the question is, can we ship a one-job-at-a-time default file, and then raise the per-PC job limit so that people who want, and have the capacity, to run more than one can easily edit the file to that end? ID: 411 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0	Message 412 - Posted: 27 May 2015, 21:29:19 UTC - in response to Message 411. Tja, the .xml file I posted above does seem to work for the vLHC tasks. It's a little bit chicken-and-egg as to what order you need to do i) creating the file; ii) resetting the project; and iii) stopping and restarting BOINC -- but it works itself out in short order. Now the question is, can we ship a one-job-at-a-time default file, and then raise the per-PC job limit so that people who want, and have the capacity, to run more than one can easily edit the file to that end? Where are you seeing anything about resetting the project? My recipe for doing that would be i) Create the file ii) Issue the 'Read config files' command from BOINC Manager If you have two tasks running at once, and set <max_concurrent> to 1, one of them will stop. Simple as that. ID: 412 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 415 - Posted: 28 May 2015, 9:51:05 UTC - in response to Message 412. In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration: "If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values." Perhaps I'm misreading it. ID: 415 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0	Message 416 - Posted: 28 May 2015, 10:04:30 UTC - in response to Message 415. That's what it says ! I do know that if all you change is the max_concurrent value then just doing a Read config files will immediately take effect. ID: 416 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0	Message 417 - Posted: 28 May 2015, 10:30:01 UTC - in response to Message 415. I think the operative word in that note might be 'remove' - either the whole file, or some entry types within it. I suspect some entries, notably the thread count settings for MT applications, may have a delayed impact - they might only come into effect when a new task is started, or when new work is fetched from the project concerned. And the GPU count is slow to display, even though it comes into effect immediately. Altogether, it's a slightly clumsy and perhaps unfinished (though incredibly useful) mechanism. I do have an editing account for that Wiki, so if I can think of a better form of words (or if anyone else can suggest one), I can post it. ID: 417 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 158 Credit: 2,919,046 RAC: 2,334	Message 557 - Posted: 13 Aug 2015, 16:01:50 UTC - in response to Message 415. In http://boinc.berkeley.edu/wiki/client_configuration#Application_configuration: "If you remove app_config.xml, or one of its entries, you must reset the project in order to restore the proper values." Perhaps I'm misreading it. HM this differs on what you are changing: If you allow one or more cores or take them away it is enough to say "Read config files" and BOINC can react on it. If you change the number of CPUs a VM can use then this will work only with the next started VM / WU, so resetting may be a good interaction. For running VMs the number of allowed cores is not changed ID: 557 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 173	Message 585 - Posted: 15 Aug 2015, 16:21:29 UTC Last modified: 15 Aug 2015, 16:30:01 UTC Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried. ID: 585 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 173	Message 1002 - Posted: 5 Sep 2015, 11:13:36 UTC - in response to Message 585. Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried. Looks as though this may have been fixed in v7.6.9. From the version history "client: fix job scheduling bug that starves CPU instances ". I haven't specifcally tested it but it is running OK on Atlas so far. Thanks, Rom. ID: 1002 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0	Message 1003 - Posted: 5 Sep 2015, 12:13:09 UTC - in response to Message 1002. Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried. Looks as though this may have been fixed in v7.6.9. From the version history "client: fix job scheduling bug that starves CPU instances ". I haven't specifcally tested it but it is running OK on Atlas so far. Thanks, Rom. That was a very specific bug - introduced by mistake in v7.6.3 - that Cliff Harding found and we worked through from here. Unless you were seeing similar symptoms in cpu_sched_debug logging - specifically like [cpu_sched_debug] using 2.00 out of 6 CPUs I doubt this change is relevant to you. Work fetch and app_config.xml still aren't hooked up. ID: 1003 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 17 Aug 15 Posts: 17 Credit: 228,358 RAC: 0	Message 1010 - Posted: 5 Sep 2015, 19:14:23 UTC - in response to Message 585. Last modified: 5 Sep 2015, 19:16:18 UTC Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). Quite true. It is not a problem here that I have found thus far, since the limit set for the CMS tasks is only "1". But to prevent CPU starvation on other projects, you can adjust the "resource share" for each project accordingly. For example I have 6 cores available (2 are reserved for GPUs), and if I want 2 cores for Project A and 4 cores for Project B, I use the app_config to limit Project A to 2 work units max. Then, I adjust the resource share to 50% for Project A, which sets those downloads to 1/3 of the total. I don't need to do anything for Project B, and it all works out well most of the time. Occasionally, there are mis-estimates for the running time, but they get corrected eventually. You don't actually need the app_config at all in that case if you are willing to live with long-term averages, but there are projects that take a lot of memory where I do want a limit at all times. ID: 1010 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 173	Message 1011 - Posted: 5 Sep 2015, 20:07:17 UTC - in response to Message 1003. Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task). It may be possible to circumvent this using the avg_ncpus setting but it didn't work for me. Setting this to 1 allowed only one task to be downloaded but after a few seconds, BOINC simply went back and got another. This is average cpus and can be fractional so maybe setting it to a lower value may work but I've not tried. Looks as though this may have been fixed in v7.6.9. From the version history "client: fix job scheduling bug that starves CPU instances ". I haven't specifcally tested it but it is running OK on Atlas so far. Thanks, Rom. That was a very specific bug - introduced by mistake in v7.6.3 - that Cliff Harding found and we worked through from here. Unless you were seeing similar symptoms in cpu_sched_debug logging - specifically like [cpu_sched_debug] using 2.00 out of 6 CPUs I doubt this change is relevant to you. Work fetch and app_config.xml still aren't hooked up. You're right.Thanks Richard ID: 1011 · Rating: 0 · rate: / Reply Quote

Toby Broom Send message Joined: 19 Aug 15 Posts: 75 Credit: 3,647,987 RAC: 57	Message 1032 - Posted: 7 Sep 2015, 17:18:34 UTC Last modified: 7 Sep 2015, 17:22:34 UTC I followed these instructions for vLHC to setup all my computers to run >1 task. I still use it on one computer to crank out more work. http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1154&postid=13411#13411 On Atlas@Home I make use of app_config to tune the WU's that my PC's can handle based on the amount of RAM, as it stomps your computer if you let it run free. The only computer that's close to working right is the on with 80Gb of ram. ID: 1032 · Rating: 0 · rate: / Reply Quote

Development for LHC@home