Message boards : Theory Application : New Muti-core version V1.9
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 142
Message 3743 - Posted: 19 Jul 2016, 8:37:27 UTC - in response to Message 3742.  

We do consider this a problem. There are a few options

1) Run short back fill jobs ~10mins.
2) Set 90 min grace period for jobs to finish.
3) Algorithm for grace period. Stop when idle time of inactive slots > runtime of active slots.
4) Try to define estimated runtime of jobs to improve scheduling.
5) A combination of the above.

Doing 2. is easy and with 3. in addition being the best optimization.
ID: 3743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3744 - Posted: 19 Jul 2016, 9:51:33 UTC - in response to Message 3743.  
Last modified: 19 Jul 2016, 10:11:14 UTC

We do consider this a problem. There are a few options

1) Run short back fill jobs ~10mins.

This option doesn't work, as BOINC is allocating all host cores dedicated to the VM even when the VM doesn't work at all, but is in running state.
BOINC isn't aware of the idling of 1 or more cores inside the VM and will thus not free any core for other jobs as long as the whole VM-task has ended.
ID: 3744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3745 - Posted: 19 Jul 2016, 10:20:23 UTC - in response to Message 3744.  
Last modified: 19 Jul 2016, 10:29:02 UTC

...and will thus not free any core


The cores are already allocated to the task. Boinc does not need to free any.

EDIT: therefore utilizing these with short jobs makes sense.
ID: 3745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3746 - Posted: 19 Jul 2016, 10:42:50 UTC - in response to Message 3745.  

...and will thus not free any core


The cores are already allocated to the task. Boinc does not need to free any.

EDIT: therefore utilizing these with short jobs makes sense.

I think Laurence meant jobs from other BOINC-projects and not vLHC.
ID: 3746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3747 - Posted: 19 Jul 2016, 10:47:15 UTC - in response to Message 3746.  

I thought, he meant jobs within the VM.
Simple misunderstanding.
Which one is it?
ID: 3747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3748 - Posted: 19 Jul 2016, 10:49:26 UTC - in response to Message 3746.  

I thought, he meant jobs within the VM.
Simple misunderstanding.
Which one is it?

EDIT: He said jobs, not tasks.
ID: 3748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 142
Message 3749 - Posted: 19 Jul 2016, 10:52:15 UTC - in response to Message 3748.  

Yes, backfill jobs within the VM.
ID: 3749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3750 - Posted: 19 Jul 2016, 10:59:23 UTC

For a starter, i suggest to change the cutoff time from 18 to 15h.
Should be simple to implement.
ID: 3750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3751 - Posted: 19 Jul 2016, 11:09:43 UTC - in response to Message 3742.  

Here is an example for how an ultra-long job affects a task.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=222063

Did not double check but the performance of this task was 61,96% overall and if without lost cpu seconds due to idling cores 87,8%.
ID: 3751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3752 - Posted: 19 Jul 2016, 11:13:24 UTC - in response to Message 3751.  
Last modified: 19 Jul 2016, 11:25:55 UTC

You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted.

EDIT: If you were running a different project and 7 out of 8 cores were idling for 5 hours, would you not be upset?
ID: 3752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3753 - Posted: 19 Jul 2016, 11:39:24 UTC - in response to Message 3752.  

You have to take into account, that the slot 2 task did not finish, therefore the time has to be deducted.

EDIT: If you were running a different project and 7 out of 8 cores were idling for 5 hours, would you not be upset?

Yeah, you're right. The last 6 hours of slot 2 were useless at the end.

Nowadays there are other problems to get upset, but I agree, me too want to get as much as possible out of my machine (when not resting).
Therefore I was running T4T with 8 tasks in earlier days even where the project was giving only 1 task at a time.
Just by running 8 BOINC-clients on the 8 threaded host.

However we're testing here other possibilities and try to get that optimized.
ID: 3753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 142
Message 3754 - Posted: 19 Jul 2016, 11:48:58 UTC - in response to Message 3752.  

Yes. I think that algorithm is the best approach but I need to figure out how to do it.

As you pointed out, you have to consider both the time lost by killing a running job and the time lost with idle cores.
ID: 3754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3755 - Posted: 19 Jul 2016, 12:42:13 UTC

Is there a way to identify these ultra long tasks?

I have noticed, that all the ones, i know of, were sherpa tasks.
If there is anything in the logs, that could identify them, that could be used to terminate them.
Or even better, if these tasks where not to be issued in the first place
ID: 3755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3756 - Posted: 19 Jul 2016, 12:44:19 UTC - in response to Message 3754.  

Just a quick scan over 1 task with 27 jobs, but I already mentioned here I also had tasks with 51 and 55 jobs.

The average of the 27 jobs was 6,765 seconds, the minimum run time was 501 seconds and the maximum run time was 17,464 seconds.

In the past I also have seen jobs running over 24 hours.
ID: 3756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3757 - Posted: 19 Jul 2016, 15:02:46 UTC
Last modified: 19 Jul 2016, 15:14:18 UTC

There could be 2 criteria to identify an ultra-long job.

1.Needs to be a "sherpa" job

2.Search the runnig-x.log for


Event yyyyyy ( 21m 39s elapsed / 2h 16m 27s left ) -> ETA: Tue Jul 19 19:07

and determine, if that would fit into the remaining time(- some safety margin)

EDIT:However, i have a sherpa running for 3 h now, that has not even reached the stage, where this would apply.
ID: 3757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3758 - Posted: 19 Jul 2016, 16:26:05 UTC

Or, even better:
Search for the string "ETA:" in the log.

I do not know, how practical that is.


(the term "ETA:" only shows in the log roughly at 40 to 60% of the job progress)

So, if the remaining time is less than the "sherpa" job has been running already, it needs to be terminated.

(Too complicated, i guess)
ID: 3758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3760 - Posted: 20 Jul 2016, 11:51:11 UTC - in response to Message 3758.  

(Too complicated, i guess)

Except sherpa all generators are rather predictable.

First line in the running slot log:
===> [runRivet] Wed Jul 20 11:14:36 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - herwig++powheg 2.5.2 LHC-UE-EE-3-7000 100000 366]

Starttime and number of events bolded.

Let's say after the 12 hours elapsed at 12:00 CEST :

34000 events processed

about 1½ hours to get this job finished.


The sherpa's however are rather unpredictable.

If first line in the running slot log contains sherpa like
===> [runRivet] Wed Jul 20 07:30:40 CEST 2016 [boinc pp jets 7000 20,-,460 - sherpa 1.4.1 default 48000 366]

and in the log you can't find "-> ETA:"

the job is still in initialization and/or optimizing phase.
The job should be killed, when elapsed time > 12 hours.
The last sherpa I had, this pre-processing lasted a bit over 4 hours, but
the 48,000 events processing in this job lasted only 20 minutes. What is wisdom here?

For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's.
When someone has a lot of cores, but is low in memory, he should use app_config.xml to reduce the number of the most RAM-demanding tasks.
ID: 3760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3761 - Posted: 20 Jul 2016, 12:10:14 UTC - in response to Message 3760.  

For me personally, I would prefer getting as many vLHC-BOINC tasks as available cores and run single core VM's.
When someone has a lot of cores, but is low in memory, he should use app_config.xml to reduce the number of the most RAM-demanding tasks.


I agree.At least then i could assign more then one core to a task, which speeds it up significantly.
Best utilization is 1.3 cores per job or 3 jobs using 4 cores.

I have had non-sherpa task running for 7.5h.If that happens near the 12h mark, it would be aborted.

I any case, some way to determine the run-time of a job is advised.

I wonder, what the percentage of boinc-tasks is, that hits the 18h mark.

(obviously, the more cores, the more likely it is to happen)
ID: 3761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 3765 - Posted: 21 Jul 2016, 12:39:01 UTC - in response to Message 3761.  

I agree.At least then i could assign more then one core to a task, which speeds it up significantly.
Best utilization is 1.3 cores per job or 3 jobs using 4 cores.

I'm afraid that's not working anymore with the new type VM.
When avg_ncpus is set higher than 1 and lower than 2, the VM will be created with 2 processors and the VM-software
will start 2 jobs in 2 slots now and not running 1 job on 2 processors using about 1.5 cores from the host.

If you still want to do that, you have to use app_info.xml and fetch an older Theory.vdi.
ID: 3765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Theory Application : New Muti-core version V1.9


©2024 CERN