Message boards :
Theory Application :
Native Theory Application in Production
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Today the Native Theory application has been added to the production project. I would like to take this opportunity to thank everyone who has been involved in the testing. Without your efforts it would be difficult to catch all the issues that occur in the different environments. It means that the production application is much more robust that it would be otherwise. The application is currently in beta (test) until we can confirm that it is working for everyone, so please could you shift your focus to the production project. Thanks again for all your effort and feedback! |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Once again I would like to say a big thank you to all those who have been involved in testing this application. From my perspective the release to production went very well. So far it has not been necessary to update the version that we originally deployed which means that we managed to catch the issues during testing. Also I would like to thank those who have used their experience from the dev project to help volunteers in the production project with their problems and questions. Before we move the application out of beta, I would like to address some server-side scheduling issues to restrict the tasks given out to hosts that fail tasks due to CVMFS not being available. In the meantime, please let me know if anything needs to be improved in the sticky or CVMFS documentation. Also, if there is any other feedback, please let me know, don't be shy. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
Hi Laurence, You may have noticed that, since the introduction of the native Theory at the production site, a problem is (re)introduced with requesting tasks with the max # of jobs and max # of CPUs. I just tested it again and as an example Max # of jobs - no limit Max # of CPUs - 1 I don't get new tasks when I have 2 tasks running and an idle core. There after I tried Max # of jobs - no limit Max # of CPUs - 2 and I got 2 tasks, but not more, where I expect at least 2 x 3 (available cores) is 6. So somehow the Max # of CPUs is influencing the number of jobs, where it should not, cause in principle we run the native Theory as single cores. With higher # of CPU's BOINC reserves cores for nothing. The user can create a work around with app_config.xml, but app_config.xml should not be used to work around bugs/failures. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Some comments As CMS is also back at the production server I tried to get both (CMS and Theory native) on the same client. Unfortunately this client only gets Theory native tasks from the server. Hence it needs manual intervention to get CMS tasks. The credits the server rewards for Theory native is rather poor. Seems only 10% of normal credit and it looks like it doesn't normalize over the time. It should be clarified why. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The native theory app seems to be running quite nicely in production. Is there anything that needs to be addressed before we take it out of beta? |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
... Is there anything that needs to be addressed before we take it out of beta? Yes. 1. Task deadlines should be extended to give longrunners a better chance to finish. 2. A method to write the job parameters to stderr.txt early should be included. A possible solution can be found here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467 3. Credits given for Theory native are still very poor, especially compared to Theory vbox. This may result in a bad acceptance. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Some comments The same applies to Theory native and Atlas (this is from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago) This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably. Also, The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
... Is there anything that needs to be addressed before we take it out of beta? Do no know what parameter needs to be set? Is it rsc_fpops_est?
I am looking at this and will respond in that thread.
It think this changes with rsc_fpops_est. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The scheduling is a bit of a black box.
It is possible to suspend the Theory tasks but not the ATLAS tasks. |
Send message Joined: 29 Apr 19 Posts: 13 Credit: 109,352 RAC: 0 |
I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. Glad to see I'm not alone in that thought. Here's my comments in another thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=465&postid=6324#6324 |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
1. Task deadlines should be extended to give longrunners a better chance to finish. The server parameter that sets the client's "report_deadline". I'm not familiar with the server templates but I doubt it's "rsc_fpops_est". My guess it could be "delay_bound". |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
1. Task deadlines should be extended to give longrunners a better chance to finish. delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 9 |
delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to?The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
delay_bound for vbox tasks seems to be 2592000 (30 days). If nobody disagrees I would suggest to set it to 432000 (5 days) or 600000 (a bit less than a week). Are there other limits we are not aware of in this thread (see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=38669)? <edit> May be obsolete due to CP's post. </edit> |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Hosts here shut down (aka switch off) during the day.and run overnight (cheaper electricity, unmetered internet and I can use the heat) There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention.. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
... Is there anything that needs to be addressed ... Tried to integrate Theory native into my existing cgroups hierarchy but so far without full success. Hence I either run it in my own (sub)cgroup to get a balanced scheduling among all projects or I run it in the project's cgroup to get suspend/resume running. Needs more knowledge/testing. |
Send message Joined: 29 Apr 19 Posts: 13 Credit: 109,352 RAC: 0 |
Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Where's the problem? There are already 2 possible solutions: 1. Run a LHC vbox app 2. Set up a self defined VM and inside this VM run a native app. Both can be paused during the expensive periods by BOINC or a timer of your choice. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
From you or us? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to?The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week. That is surprising. We’re those jobs fine otherwise? |
©2024 CERN