Message boards : Theory Application : Native Theory Application in Production
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6233 - Posted: 18 Mar 2019, 11:17:12 UTC

Today the Native Theory application has been added to the production project.

I would like to take this opportunity to thank everyone who has been involved in the testing. Without your efforts it would be difficult to catch all the issues that occur in the different environments. It means that the production application is much more robust that it would be otherwise.

The application is currently in beta (test) until we can confirm that it is working for everyone, so please could you shift your focus to the production project.

Thanks again for all your effort and feedback!
ID: 6233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6248 - Posted: 25 Mar 2019, 11:31:57 UTC - in response to Message 6233.  
Last modified: 25 Mar 2019, 11:32:52 UTC

Once again I would like to say a big thank you to all those who have been involved in testing this application. From my perspective the release to production went very well. So far it has not been necessary to update the version that we originally deployed which means that we managed to catch the issues during testing. Also I would like to thank those who have used their experience from the dev project to help volunteers in the production project with their problems and questions. Before we move the application out of beta, I would like to address some server-side scheduling issues to restrict the tasks given out to hosts that fail tasks due to CVMFS not being available. In the meantime, please let me know if anything needs to be improved in the sticky or CVMFS documentation. Also, if there is any other feedback, please let me know, don't be shy.
ID: 6248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6249 - Posted: 25 Mar 2019, 12:09:23 UTC - in response to Message 6248.  

Hi Laurence,

You may have noticed that, since the introduction of the native Theory at the production site, a problem is (re)introduced with requesting tasks with the max # of jobs and max # of CPUs. I just tested it again and as an example

Max # of jobs - no limit
Max # of CPUs - 1

I don't get new tasks when I have 2 tasks running and an idle core.

There after I tried

Max # of jobs - no limit
Max # of CPUs - 2

and I got 2 tasks, but not more, where I expect at least 2 x 3 (available cores) is 6.

So somehow the Max # of CPUs is influencing the number of jobs, where it should not, cause in principle we run the native Theory as single cores.
With higher # of CPU's BOINC reserves cores for nothing. The user can create a work around with app_config.xml, but app_config.xml should not be used to work around bugs/failures.
ID: 6249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6251 - Posted: 25 Mar 2019, 13:02:28 UTC

Some comments

As CMS is also back at the production server I tried to get both (CMS and Theory native) on the same client.
Unfortunately this client only gets Theory native tasks from the server.
Hence it needs manual intervention to get CMS tasks.


The credits the server rewards for Theory native is rather poor.
Seems only 10% of normal credit and it looks like it doesn't normalize over the time.
It should be clarified why.
ID: 6251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6318 - Posted: 2 May 2019, 14:29:55 UTC - in response to Message 6248.  

The native theory app seems to be running quite nicely in production. Is there anything that needs to be addressed before we take it out of beta?
ID: 6318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6319 - Posted: 2 May 2019, 16:25:12 UTC - in response to Message 6318.  

... Is there anything that needs to be addressed before we take it out of beta?

Yes.

1. Task deadlines should be extended to give longrunners a better chance to finish.

2. A method to write the job parameters to stderr.txt early should be included.
A possible solution can be found here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467

3. Credits given for Theory native are still very poor, especially compared to Theory vbox.
This may result in a bad acceptance.
ID: 6319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 6322 - Posted: 3 May 2019, 9:15:09 UTC - in response to Message 6251.  
Last modified: 3 May 2019, 9:19:02 UTC

Some comments

As CMS is also back at the production server I tried to get both (CMS and Theory native) on the same client.
Unfortunately this client only gets Theory native tasks from the server.
Hence it needs manual intervention to get CMS tasks.


The credits the server rewards for Theory native is rather poor.
Seems only 10% of normal credit and it looks like it doesn't normalize over the time.
It should be clarified why.


The same applies to Theory native and Atlas (this is from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago)
This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably.

Also,
The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC.
ID: 6322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6325 - Posted: 3 May 2019, 12:04:28 UTC - in response to Message 6319.  

... Is there anything that needs to be addressed before we take it out of beta?

Yes.

1. Task deadlines should be extended to give longrunners a better chance to finish.

Do no know what parameter needs to be set? Is it rsc_fpops_est?

2. A method to write the job parameters to stderr.txt early should be included.
A possible solution can be found here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467


I am looking at this and will respond in that thread.

3. Credits given for Theory native are still very poor, especially compared to Theory vbox.
This may result in a bad acceptance.

It think this changes with rsc_fpops_est.
ID: 6325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6326 - Posted: 3 May 2019, 12:08:20 UTC - in response to Message 6322.  



The same applies to Theory native and Atlas (this is from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago)
This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably.


The scheduling is a bit of a black box.


Also,
The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC.


It is possible to suspend the Theory tasks but not the ATLAS tasks.
ID: 6326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Apr 19
Posts: 13
Credit: 109,352
RAC: 0
Message 6330 - Posted: 3 May 2019, 12:36:15 UTC - in response to Message 6322.  
Last modified: 3 May 2019, 12:36:27 UTC

I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC.


Glad to see I'm not alone in that thought.

Here's my comments in another thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=465&postid=6324#6324
ID: 6330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6332 - Posted: 3 May 2019, 13:21:02 UTC - in response to Message 6325.  

1. Task deadlines should be extended to give longrunners a better chance to finish.

Do no know what parameter needs to be set? Is it rsc_fpops_est?

The server parameter that sets the client's "report_deadline".

I'm not familiar with the server templates but I doubt it's "rsc_fpops_est".
My guess it could be "delay_bound".
ID: 6332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6335 - Posted: 3 May 2019, 14:13:24 UTC - in response to Message 6332.  

1. Task deadlines should be extended to give longrunners a better chance to finish.

Do no know what parameter needs to be set? Is it rsc_fpops_est?

The server parameter that sets the client's "report_deadline".

I'm not familiar with the server templates but I doubt it's "rsc_fpops_est".
My guess it could be "delay_bound".


delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to?
ID: 6335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6336 - Posted: 3 May 2019, 14:41:36 UTC - in response to Message 6335.  

delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to?
The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week.
ID: 6336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6337 - Posted: 3 May 2019, 14:53:16 UTC - in response to Message 6335.  
Last modified: 3 May 2019, 14:54:21 UTC

delay_bound for vbox tasks seems to be 2592000 (30 days).
If nobody disagrees I would suggest to set it to 432000 (5 days) or 600000 (a bit less than a week).
Are there other limits we are not aware of in this thread (see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=38669)?

<edit>
May be obsolete due to CP's post.
</edit>
ID: 6337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 6339 - Posted: 3 May 2019, 15:41:35 UTC - in response to Message 6326.  


Also,
The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC.


It is possible to suspend the Theory tasks but not the ATLAS tasks.

Hosts here shut down (aka switch off) during the day.and run overnight (cheaper electricity, unmetered internet and I can use the heat)
There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention..
ID: 6339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6340 - Posted: 3 May 2019, 15:59:00 UTC - in response to Message 6318.  

... Is there anything that needs to be addressed ...

Tried to integrate Theory native into my existing cgroups hierarchy but so far without full success.
Hence I either run it in my own (sub)cgroup to get a balanced scheduling among all projects or I run it in the project's cgroup to get suspend/resume running.

Needs more knowledge/testing.
ID: 6340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
marmot

Send message
Joined: 29 Apr 19
Posts: 13
Credit: 109,352
RAC: 0
Message 6343 - Posted: 3 May 2019, 17:13:01 UTC - in response to Message 6339.  


There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention..

Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes.
ID: 6343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6346 - Posted: 3 May 2019, 19:54:41 UTC - in response to Message 6343.  


There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention..

Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes.

Where's the problem?
There are already 2 possible solutions:

1. Run a LHC vbox app
2. Set up a self defined VM and inside this VM run a native app.


Both can be paused during the expensive periods by BOINC or a timer of your choice.
ID: 6346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6347 - Posted: 3 May 2019, 21:51:19 UTC - in response to Message 6340.  


Tried to integrate Theory native into my existing cgroups hierarchy but so far without full success.
Hence I either run it in my own (sub)cgroup to get a balanced scheduling among all projects or I run it in the project's cgroup to get suspend/resume running.

Needs more knowledge/testing.


From you or us?
ID: 6347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 6348 - Posted: 3 May 2019, 21:53:13 UTC - in response to Message 6336.  

delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to?
The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week.


That is surprising. We’re those jobs fine otherwise?
ID: 6348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Native Theory Application in Production


©2024 CERN