Message boards :
Theory Application :
New Muti-core version V1.9
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This is a test for the multi-core version of the theory app. Expect things to break. If you don't like things breaking, please crunch on the production project. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Is this working for anyone? I am having difficulty. Two tasks are downloaded, and both have 1.5 CPUs :( One is running and one is waiting to start. What do I need to do to get one task that has 3 CPUs? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
This is working for me after I stopped the VM after a first try with the zombie java-process. Thereafter 2 cores are allocated (I set the system to 2 cores) and 2 science processes are running, but only one is displayed in the running-log. 16:02:28 +0200 2016-07-13 [INFO] New Job Starting 16:02:29 +0200 2016-07-13 [INFO] Condor JobID: 1178883 16:02:31 +0200 2016-07-13 [INFO] New Job Starting 16:02:31 +0200 2016-07-13 [INFO] Condor JobID: 1178881 16:02:34 +0200 2016-07-13 [INFO] MCPlots JobID: 31917765 16:02:36 +0200 2016-07-13 [INFO] MCPlots JobID: 31917806 I also increased the memory for the VM to 1024 (2x512 for each core) You have to think about using containers if you want to run more than 1 job in a VM. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I decreased the number of tasks given out by the project to 1 and I now get one task with 1.5 CPUs. My machine has four cores and 3 are available for BOINC. Condor which is inside the VM will dynamically create the number of job slots to match the number of CPUs and split the memory evenly between them. So for Theory we just need to end up with about 600MB per core but am a little confused now to do this. The other thing we need to do it sort out the multiple job logs ... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have been using mulitcore for quite a while now. I simply allocated more cores (1.3) to a task in the app_config.xml. Works fine. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
The other thing we need to do it sort out the multiple job logs ... 2016-07-13 18:08:39 (6424): Guest Log: [INFO] Theory application starting. Check log files. 2016-07-13 18:10:09 (6424): Guest Log: [[IINNFFOO]] NNeeww JJobo bS tSatratritnigng 2016-07-13 18:10:09 (6424): Guest Log: [[IINNFFOO]] NNeeww JJobo bS tSatratritnigng 2016-07-13 18:10:09 (6424): Guest Log: [INFO] Co[ndIoNr FJOo]b ICDo:n d o1r 1J801o0b6I 2016-07-13 18:10:09 (6424): Guest Log: 1180105 2016-07-13 18:10:20 (6424): Guest Log: [INFO] MCPlots JobID: 31919035 2016-07-13 18:10:20 (6424): Guest Log: [INFO] MCPlots JobID: 31918954 ... and how to decide when it's time for a graceful shutdown without loosing cpu-time, because 1 core is idle and the other one is still busy with its last job. ... or 1 core is idle because no jobs available and the other core still busy. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What is going to be achieved here? 1. Have one task with one job at a time running on multiple cores 2. Run multiple jobs simultaneously in one task (VM)with one core or more per job In my opinion, running multiple jobs in one VM in parallel is asking for trouble and makes no sense. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The goal is to run multiple jobs in parallel within the same VM. This approach reduces the disk and memory requirements at the cost of some idle CPU. That cost is half the run time of one job multiplied by the number of cores, or divide half the run time of one job by the time the VM was up and you get the efficiency. It may not be for everyone but for the power volunteers it is an option. The good thing about multicore is that if the number of cores is one, it is what we have now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
What do I need to do to get one task that has 3 CPUs? For testing I've a mt_mcore running with 3 processors dedicated in the VM using the app_config.xml with app_version part: <app_version> <app_name>Theory</app_name> <plan_class>vbox64_mt_mcore</plan_class> <avg_ncpus>3.000000</avg_ncpus> <cmdline>--nthreads 3.000000</cmdline> <cmdline>--memory_size_mb 1536</cmdline> </app_version> Three theory jobs has started, consuming about 1.01 GB RAM, 0.44 GB free and no swap used at all so far. Due to missing logs not sure, but 2 pythia8's and probably 1 pythia6 running at the moment. 21:19:12 +0200 2016-07-13 [INFO] New Job Starting 21:19:12 +0200 2016-07-13 [INFO] New Job Starting 21:19:12 +0200 2016-07-13 [INFO] Condor JobID: 1181998 21:19:12 +0200 2016-07-13 [INFO] New Job Starting 21:19:12 +0200 2016-07-13 [INFO] Condor JobID: 1181997 21:19:12 +0200 2016-07-13 [INFO] Condor JobID: 1181996 21:19:17 +0200 2016-07-13 [INFO] MCPlots JobID: 31920864 21:19:17 +0200 2016-07-13 [INFO] MCPlots JobID: 31920920 21:19:18 +0200 2016-07-13 [INFO] MCPlots JobID: 31920834 I let it run overnight and will see how it end after 12 or 18 hours. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am seeing the slot number added to the job wrapper output. However, that does not help, as all jobs are running in the same Task/slot. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Apparently, these slot numbers are vm internal, not boinc-slots. Never mind. |
Send message Joined: 26 Feb 15 Posts: 26 Credit: 5,042,431 RAC: 910 |
How will the BOINC client be able to know to run only one of these tasks, which will use all CPU threads, and no other CPU tasks of any kind from any other project? Or even tasks from other applications on the same project? Also, Assuming a single task using 4 CPU threads within the VM, how will BOINC know to credit 4x threads of CPU time? Here is the thing: Without using a VM, BOINC can already do all this natively. Several projects already use this method. Are we trying to re-create something already available, just so that it can be in a VM? Reno, NV Team: SETI.USA |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
I am seeing the slot number added to the job wrapper output. With my 3 processor-VM I see this now: running-slot1.log 14-Jul-2016 08:02 24K running-slot2.log 14-Jul-2016 08:03 61K running-slot3.log 14-Jul-2016 08:03 31K running.log 14-Jul-2016 08:02 24K In my case from 1 job the log is written twice - running-slot1.log and running.log have the same contents. Edit: Since about 01:20 CEST the finished_XX.logs are now all complete and not cut off somewhere when a new job started in another directory. stdout.log provides a bit more information. At the moment 1 sherpa, 1 herwig++ and 1 pythia6 running the VM uses 1.3GB of memory. Swapfile hardly used overnight: 3584k. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The big question is, how a task is ended. 1. When the first job ends after the 12h mark--all remaining jobs would be abandoned. 2 When the longest running job-slot ends--This would mean, that any previous job-slot ending after the 12h mark must not get a new job, otherwise the 18h cutoff time would be reached eventually and some job(s) would be abandoned. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It should be 2. If a job runs on average for 90 mins, all slots should be empty on average within 45mins after the 12h mark. There will be a step function with the decreasing number of cores being used. The trade off in this case would be 6.24% of idle time to save 8GB of disk space per core and reduce some memory usage. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Hi zombie67, I am new to multi-core with BOINC so can't answer all your questions but AFAIK the BOINC does support multi-core VMs as this is what we are experimenting with here. We need to run in a VM as the software we used is not available for Windows which 80% of the volunteers use. Our tasks are embarrassingly parallel so we can either run one job per VM and run many VMs or many jobs in one VM. There is always a trade off and here it is some idle CPU for reduced Disk space and network usage. If we can support both approaches then it up to you which you would prefer to run. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
I returned the first task running the whole sequence: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=221315 At 09:15 12 hours elapsed were over. At 10:53 job in slot 2 finished and no new job was started At 11:30 job in slot 1 finished and no new job was started At 12:00 job in slot 3 finished (started at 09:06) and the VM got its shutdown signal. The Condor and MCPlots JobID's were not added to BOINC's stderr.txt anymore after last night change. In the just newly started task it is. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
So at the end of the task we have 97 minutes of idle time and 2378 minutes of productive time giving a 4% idle time overhead for reducing the overhead of multiple VMs. Has anyone got a spare 8 cores to play with? The maximum we can do at the moment is 10 cores but I can easily increase this. EDIT: We could halve that overhead by doubling the lifetime of the VM. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
Has anyone got a spare 8 cores to play with? The maximum we can do at the moment is 10 cores but I can easily increase this. I'll setup my i7 with 8 threads (4 cores hyperthreaded) to run a VM with 8 processors, but will set VboxHeadless.exe to priority 'below normal' and set in BOINC CPU to 90% (Execution Cap 90) to keep my machine responsive. I'll start it later today so that I can watch the ending of the processes tomorrow morning. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 330 |
So at the end of the task we have 97 minutes of idle time and 2378 minutes of productive time giving a 4% idle time overhead for reducing the overhead of multiple VMs. I've got a 20-core (128 GB) and a 12-core (64 GB) that I could try. :-) |
©2024 CERN