Message boards :
Theory Application :
New Muti-core version V1.9
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
We do consider this a problem. There are a few options 1) Run short back fill jobs ~10mins. 2) Set 90 min grace period for jobs to finish. 3) Algorithm for grace period. Stop when idle time of inactive slots > runtime of active slots. 4) Try to define estimated runtime of jobs to improve scheduling. 5) A combination of the above. Doing 2. is easy and with 3. in addition being the best optimization. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
We do consider this a problem. There are a few options This option doesn't work, as BOINC is allocating all host cores dedicated to the VM even when the VM doesn't work at all, but is in running state. BOINC isn't aware of the idling of 1 or more cores inside the VM and will thus not free any core for other jobs as long as the whole VM-task has ended. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
...and will thus not free any core The cores are already allocated to the task. Boinc does not need to free any. EDIT: therefore utilizing these with short jobs makes sense. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
...and will thus not free any core I think Laurence meant jobs from other BOINC-projects and not vLHC. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I thought, he meant jobs within the VM. Simple misunderstanding. Which one is it? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I thought, he meant jobs within the VM. Simple misunderstanding. Which one is it? EDIT: He said jobs, not tasks. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yes, backfill jobs within the VM. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
For a starter, i suggest to change the cutoff time from 18 to 15h. Should be simple to implement. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
Here is an example for how an ultra-long job affects a task. Did not double check but the performance of this task was 61,96% overall and if without lost cpu seconds due to idling cores 87,8%. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted. EDIT: If you were running a different project and 7 out of 8 cores were idling for 5 hours, would you not be upset? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted. Yeah, you're right. The last 6 hours of slot 2 were useless at the end. Nowadays there are other problems to get upset, but I agree, me too want to get as much as possible out of my machine (when not resting). Therefore I was running T4T with 8 tasks in earlier days even where the project was giving only 1 task at a time. Just by running 8 BOINC-clients on the 8 threaded host. However we're testing here other possibilities and try to get that optimized. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yes. I think that algorithm is the best approach but I need to figure out how to do it. As you pointed out, you have to consider both the time lost by killing a running job and the time lost with idle cores. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is there a way to identify these ultra long tasks? I have noticed, that all the ones, i know of, were sherpa tasks. If there is anything in the logs, that could identify them, that could be used to terminate them. Or even better, if these tasks where not to be issued in the first place |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
Just a quick scan over 1 task with 27 jobs, but I already mentioned here I also had tasks with 51 and 55 jobs. The average of the 27 jobs was 6,765 seconds, the minimum run time was 501 seconds and the maximum run time was 17,464 seconds. In the past I also have seen jobs running over 24 hours. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There could be 2 criteria to identify an ultra-long job. 1.Needs to be a "sherpa" job 2.Search the runnig-x.log for Event yyyyyy ( 21m 39s elapsed / 2h 16m 27s left ) -> ETA: Tue Jul 19 19:07 and determine, if that would fit into the remaining time(- some safety margin) EDIT:However, i have a sherpa running for 3 h now, that has not even reached the stage, where this would apply. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Or, even better: Search for the string "ETA:" in the log. I do not know, how practical that is. (the term "ETA:" only shows in the log roughly at 40 to 60% of the job progress) So, if the remaining time is less than the "sherpa" job has been running already, it needs to be terminated. (Too complicated, i guess) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
(Too complicated, i guess) Except sherpa all generators are rather predictable. First line in the running slot log: ===> [runRivet] Wed Jul 20 11:14:36 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - herwig++powheg 2.5.2 LHC-UE-EE-3-7000 100000 366] Starttime and number of events bolded. Let's say after the 12 hours elapsed at 12:00 CEST : 34000 events processed about 1½ hours to get this job finished. The sherpa's however are rather unpredictable. If first line in the running slot log contains sherpa like ===> [runRivet] Wed Jul 20 07:30:40 CEST 2016 [boinc pp jets 7000 20,-,460 - sherpa 1.4.1 default 48000 366] and in the log you can't find "-> ETA:" the job is still in initialization and/or optimizing phase. The job should be killed, when elapsed time > 12 hours. The last sherpa I had, this pre-processing lasted a bit over 4 hours, but the 48,000 events processing in this job lasted only 20 minutes. What is wisdom here? For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's. When someone has a lot of cores, but is low in memory, he should use app_config.xml to reduce the number of the most RAM-demanding tasks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's. I agree.At least then i could assign more then one core to a task, which speeds it up significantly. Best utilization is 1.3 cores per job or 3 jobs using 4 cores. I have had non-sherpa task running for 7.5h.If that happens near the 12h mark, it would be aborted. I any case, some way to determine the run-time of a job is advised. I wonder, what the percentage of boinc-tasks is, that hits the 18h mark. (obviously, the more cores, the more likely it is to happen) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 875,820 RAC: 659 |
I agree.At least then i could assign more then one core to a task, which speeds it up significantly. I'm afraid that's not working anymore with the new type VM. When avg_ncpus is set higher than 1 and lower than 2, the VM will be created with 2 processors and the VM-software will start 2 jobs in 2 slots now and not running 1 job on 2 processors using about 1.5 cores from the host. If you still want to do that, you have to use app_info.xml and fetch an older Theory.vdi. |
©2025 CERN