Thread 'New Muti-core version V1.9'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,274 RAC: 2	Message 3743 - Posted: 19 Jul 2016, 8:37:27 UTC - in response to Message 3742. We do consider this a problem. There are a few options 1) Run short back fill jobs ~10mins. 2) Set 90 min grace period for jobs to finish. 3) Algorithm for grace period. Stop when idle time of inactive slots > runtime of active slots. 4) Try to define estimated runtime of jobs to improve scheduling. 5) A combination of the above. Doing 2. is easy and with 3. in addition being the best optimization. ID: 3743 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3744 - Posted: 19 Jul 2016, 9:51:33 UTC - in response to Message 3743. Last modified: 19 Jul 2016, 10:11:14 UTC We do consider this a problem. There are a few options 1) Run short back fill jobs ~10mins. This option doesn't work, as BOINC is allocating all host cores dedicated to the VM even when the VM doesn't work at all, but is in running state. BOINC isn't aware of the idling of 1 or more cores inside the VM and will thus not free any core for other jobs as long as the whole VM-task has ended. ID: 3744 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3745 - Posted: 19 Jul 2016, 10:20:23 UTC - in response to Message 3744. Last modified: 19 Jul 2016, 10:29:02 UTC ...and will thus not free any core The cores are already allocated to the task. Boinc does not need to free any. EDIT: therefore utilizing these with short jobs makes sense. ID: 3745 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3746 - Posted: 19 Jul 2016, 10:42:50 UTC - in response to Message 3745. ...and will thus not free any core The cores are already allocated to the task. Boinc does not need to free any. EDIT: therefore utilizing these with short jobs makes sense. I think Laurence meant jobs from other BOINC-projects and not vLHC. ID: 3746 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3747 - Posted: 19 Jul 2016, 10:47:15 UTC - in response to Message 3746. I thought, he meant jobs within the VM. Simple misunderstanding. Which one is it? ID: 3747 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3748 - Posted: 19 Jul 2016, 10:49:26 UTC - in response to Message 3746. I thought, he meant jobs within the VM. Simple misunderstanding. Which one is it? EDIT: He said jobs, not tasks. ID: 3748 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,274 RAC: 2	Message 3749 - Posted: 19 Jul 2016, 10:52:15 UTC - in response to Message 3748. Yes, backfill jobs within the VM. ID: 3749 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3750 - Posted: 19 Jul 2016, 10:59:23 UTC For a starter, i suggest to change the cutoff time from 18 to 15h. Should be simple to implement. ID: 3750 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3751 - Posted: 19 Jul 2016, 11:09:43 UTC - in response to Message 3742. Here is an example for how an ultra-long job affects a task. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=222063 Did not double check but the performance of this task was 61,96% overall and if without lost cpu seconds due to idling cores 87,8%. ID: 3751 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3752 - Posted: 19 Jul 2016, 11:13:24 UTC - in response to Message 3751. Last modified: 19 Jul 2016, 11:25:55 UTC You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted. EDIT: If you were running a different project and 7 out of 8 cores were idling for 5 hours, would you not be upset? ID: 3752 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3753 - Posted: 19 Jul 2016, 11:39:24 UTC - in response to Message 3752. You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted. EDIT: If you were running a different project and 7 out of 8 cores were idling for 5 hours, would you not be upset? Yeah, you're right. The last 6 hours of slot 2 were useless at the end. Nowadays there are other problems to get upset, but I agree, me too want to get as much as possible out of my machine (when not resting). Therefore I was running T4T with 8 tasks in earlier days even where the project was giving only 1 task at a time. Just by running 8 BOINC-clients on the 8 threaded host. However we're testing here other possibilities and try to get that optimized. ID: 3753 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,274 RAC: 2	Message 3754 - Posted: 19 Jul 2016, 11:48:58 UTC - in response to Message 3752. Yes. I think that algorithm is the best approach but I need to figure out how to do it. As you pointed out, you have to consider both the time lost by killing a running job and the time lost with idle cores. ID: 3754 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3755 - Posted: 19 Jul 2016, 12:42:13 UTC Is there a way to identify these ultra long tasks? I have noticed, that all the ones, i know of, were sherpa tasks. If there is anything in the logs, that could identify them, that could be used to terminate them. Or even better, if these tasks where not to be issued in the first place ID: 3755 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3756 - Posted: 19 Jul 2016, 12:44:19 UTC - in response to Message 3754. Just a quick scan over 1 task with 27 jobs, but I already mentioned here I also had tasks with 51 and 55 jobs. The average of the 27 jobs was 6,765 seconds, the minimum run time was 501 seconds and the maximum run time was 17,464 seconds. In the past I also have seen jobs running over 24 hours. ID: 3756 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3757 - Posted: 19 Jul 2016, 15:02:46 UTC Last modified: 19 Jul 2016, 15:14:18 UTC There could be 2 criteria to identify an ultra-long job. 1.Needs to be a "sherpa" job 2.Search the runnig-x.log for Event yyyyyy ( 21m 39s elapsed / 2h 16m 27s left ) -> ETA: Tue Jul 19 19:07 and determine, if that would fit into the remaining time(- some safety margin) EDIT:However, i have a sherpa running for 3 h now, that has not even reached the stage, where this would apply. ID: 3757 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3758 - Posted: 19 Jul 2016, 16:26:05 UTC Or, even better: Search for the string "ETA:" in the log. I do not know, how practical that is. (the term "ETA:" only shows in the log roughly at 40 to 60% of the job progress) So, if the remaining time is less than the "sherpa" job has been running already, it needs to be terminated. (Too complicated, i guess) ID: 3758 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3760 - Posted: 20 Jul 2016, 11:51:11 UTC - in response to Message 3758. (Too complicated, i guess) Except sherpa all generators are rather predictable. First line in the running slot log: ===> [runRivet] Wed Jul 20 11:14:36 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - herwig++powheg 2.5.2 LHC-UE-EE-3-7000 100000 366] Starttime and number of events bolded. Let's say after the 12 hours elapsed at 12:00 CEST : 34000 events processed about 1½ hours to get this job finished. The sherpa's however are rather unpredictable. If first line in the running slot log contains sherpa like ===> [runRivet] Wed Jul 20 07:30:40 CEST 2016 [boinc pp jets 7000 20,-,460 - sherpa 1.4.1 default 48000 366] and in the log you can't find "-> ETA:" the job is still in initialization and/or optimizing phase. The job should be killed, when elapsed time > 12 hours. The last sherpa I had, this pre-processing lasted a bit over 4 hours, but the 48,000 events processing in this job lasted only 20 minutes. What is wisdom here? For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's. When someone has a lot of cores, but is low in memory, he should use app_config.xml to reduce the number of the most RAM-demanding tasks. ID: 3760 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 9	Message 3761 - Posted: 20 Jul 2016, 12:10:14 UTC - in response to Message 3760. For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's. When someone has a lot of cores, but is low in memory, he should use app_config.xml to reduce the number of the most RAM-demanding tasks. I agree.At least then i could assign more then one core to a task, which speeds it up significantly. Best utilization is 1.3 cores per job or 3 jobs using 4 cores. I have had non-sherpa task running for 7.5h.If that happens near the 12h mark, it would be aborted. I any case, some way to determine the run-time of a job is advised. I wonder, what the percentage of boinc-tasks is, that hits the 18h mark. (obviously, the more cores, the more likely it is to happen) ID: 3761 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,382 RAC: 41	Message 3765 - Posted: 21 Jul 2016, 12:39:01 UTC - in response to Message 3761. I agree.At least then i could assign more then one core to a task, which speeds it up significantly. Best utilization is 1.3 cores per job or 3 jobs using 4 cores. I'm afraid that's not working anymore with the new type VM. When avg_ncpus is set higher than 1 and lower than 2, the VM will be created with 2 processors and the VM-software will start 2 jobs in 2 slots now and not running 1 job on 2 processors using about 1.5 cores from the host. If you still want to do that, you have to use app_info.xml and fetch an older Theory.vdi. ID: 3765 · Rating: 0 · rate: / Reply Quote

Development for LHC@home