Thread 'Native Theory Application in Production'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6233 - Posted: 18 Mar 2019, 11:17:12 UTC Today the Native Theory application has been added to the production project. I would like to take this opportunity to thank everyone who has been involved in the testing. Without your efforts it would be difficult to catch all the issues that occur in the different environments. It means that the production application is much more robust that it would be otherwise. The application is currently in beta (test) until we can confirm that it is working for everyone, so please could you shift your focus to the production project. Thanks again for all your effort and feedback! ID: 6233 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6248 - Posted: 25 Mar 2019, 11:31:57 UTC - in response to Message 6233. Last modified: 25 Mar 2019, 11:32:52 UTC Once again I would like to say a big thank you to all those who have been involved in testing this application. From my perspective the release to production went very well. So far it has not been necessary to update the version that we originally deployed which means that we managed to catch the issues during testing. Also I would like to thank those who have used their experience from the dev project to help volunteers in the production project with their problems and questions. Before we move the application out of beta, I would like to address some server-side scheduling issues to restrict the tasks given out to hosts that fail tasks due to CVMFS not being available. In the meantime, please let me know if anything needs to be improved in the sticky or CVMFS documentation. Also, if there is any other feedback, please let me know, don't be shy. ID: 6248 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6249 - Posted: 25 Mar 2019, 12:09:23 UTC - in response to Message 6248. Hi Laurence, You may have noticed that, since the introduction of the native Theory at the production site, a problem is (re)introduced with requesting tasks with the max # of jobs and max # of CPUs. I just tested it again and as an example Max # of jobs - no limit Max # of CPUs - 1 I don't get new tasks when I have 2 tasks running and an idle core. There after I tried Max # of jobs - no limit Max # of CPUs - 2 and I got 2 tasks, but not more, where I expect at least 2 x 3 (available cores) is 6. So somehow the Max # of CPUs is influencing the number of jobs, where it should not, cause in principle we run the native Theory as single cores. With higher # of CPU's BOINC reserves cores for nothing. The user can create a work around with app_config.xml, but app_config.xml should not be used to work around bugs/failures. ID: 6249 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6251 - Posted: 25 Mar 2019, 13:02:28 UTC Some comments As CMS is also back at the production server I tried to get both (CMS and Theory native) on the same client. Unfortunately this client only gets Theory native tasks from the server. Hence it needs manual intervention to get CMS tasks. The credits the server rewards for Theory native is rather poor. Seems only 10% of normal credit and it looks like it doesn't normalize over the time. It should be clarified why. ID: 6251 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6318 - Posted: 2 May 2019, 14:29:55 UTC - in response to Message 6248. The native theory app seems to be running quite nicely in production. Is there anything that needs to be addressed before we take it out of beta? ID: 6318 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6319 - Posted: 2 May 2019, 16:25:12 UTC - in response to Message 6318. ... Is there anything that needs to be addressed before we take it out of beta? Yes. 1. Task deadlines should be extended to give longrunners a better chance to finish. 2. A method to write the job parameters to stderr.txt early should be included. A possible solution can be found here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467 3. Credits given for Theory native are still very poor, especially compared to Theory vbox. This may result in a bad acceptance. ID: 6319 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 6322 - Posted: 3 May 2019, 9:15:09 UTC - in response to Message 6251. Last modified: 3 May 2019, 9:19:02 UTC Some comments As CMS is also back at the production server I tried to get both (CMS and Theory native) on the same client. Unfortunately this client only gets Theory native tasks from the server. Hence it needs manual intervention to get CMS tasks. The credits the server rewards for Theory native is rather poor. Seems only 10% of normal credit and it looks like it doesn't normalize over the time. It should be clarified why. The same applies to Theory native and Atlas (this is from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago) This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably. Also, The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. ID: 6322 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6325 - Posted: 3 May 2019, 12:04:28 UTC - in response to Message 6319. ... Is there anything that needs to be addressed before we take it out of beta? Yes. 1. Task deadlines should be extended to give longrunners a better chance to finish. Do no know what parameter needs to be set? Is it rsc_fpops_est? 2. A method to write the job parameters to stderr.txt early should be included. A possible solution can be found here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467 I am looking at this and will respond in that thread. 3. Credits given for Theory native are still very poor, especially compared to Theory vbox. This may result in a bad acceptance. It think this changes with rsc_fpops_est. ID: 6325 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6326 - Posted: 3 May 2019, 12:08:20 UTC - in response to Message 6322. The same applies to Theory native and Atlas (this is from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago) This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably. The scheduling is a bit of a black box. Also, The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. It is possible to suspend the Theory tasks but not the ATLAS tasks. ID: 6326 · Rating: 0 · rate: / Reply Quote

marmot Send message Joined: 29 Apr 19 Posts: 13 Credit: 109,352 RAC: 0	Message 6330 - Posted: 3 May 2019, 12:36:15 UTC - in response to Message 6322. Last modified: 3 May 2019, 12:36:27 UTC I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. Glad to see I'm not alone in that thought. Here's my comments in another thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=465&postid=6324#6324 ID: 6330 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6332 - Posted: 3 May 2019, 13:21:02 UTC - in response to Message 6325. 1. Task deadlines should be extended to give longrunners a better chance to finish. Do no know what parameter needs to be set? Is it rsc_fpops_est? The server parameter that sets the client's "report_deadline". I'm not familiar with the server templates but I doubt it's "rsc_fpops_est". My guess it could be "delay_bound". ID: 6332 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6335 - Posted: 3 May 2019, 14:13:24 UTC - in response to Message 6332. 1. Task deadlines should be extended to give longrunners a better chance to finish. Do no know what parameter needs to be set? Is it rsc_fpops_est? The server parameter that sets the client's "report_deadline". I'm not familiar with the server templates but I doubt it's "rsc_fpops_est". My guess it could be "delay_bound". delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to? ID: 6335 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6336 - Posted: 3 May 2019, 14:41:36 UTC - in response to Message 6335. delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to? The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week. ID: 6336 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6337 - Posted: 3 May 2019, 14:53:16 UTC - in response to Message 6335. Last modified: 3 May 2019, 14:54:21 UTC delay_bound for vbox tasks seems to be 2592000 (30 days). If nobody disagrees I would suggest to set it to 432000 (5 days) or 600000 (a bit less than a week). Are there other limits we are not aware of in this thread (see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=38669)? <edit> May be obsolete due to CP's post. </edit> ID: 6337 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 6339 - Posted: 3 May 2019, 15:41:35 UTC - in response to Message 6326. Also, The need to run hosts continuously is a problem for me and, I expect, for many others as well, especially if the project wants to widen it's potential pool of volunteers. I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC. It is possible to suspend the Theory tasks but not the ATLAS tasks. Hosts here shut down (aka switch off) during the day.and run overnight (cheaper electricity, unmetered internet and I can use the heat) There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention.. ID: 6339 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6340 - Posted: 3 May 2019, 15:59:00 UTC - in response to Message 6318. ... Is there anything that needs to be addressed ... Tried to integrate Theory native into my existing cgroups hierarchy but so far without full success. Hence I either run it in my own (sub)cgroup to get a balanced scheduling among all projects or I run it in the project's cgroup to get suspend/resume running. Needs more knowledge/testing. ID: 6340 · Rating: 0 · rate: / Reply Quote

marmot Send message Joined: 29 Apr 19 Posts: 13 Credit: 109,352 RAC: 0	Message 6343 - Posted: 3 May 2019, 17:13:01 UTC - in response to Message 6339. There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention.. Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes. ID: 6343 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6346 - Posted: 3 May 2019, 19:54:41 UTC - in response to Message 6343. There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention.. Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes. Where's the problem? There are already 2 possible solutions: 1. Run a LHC vbox app 2. Set up a self defined VM and inside this VM run a native app. Both can be paused during the expensive periods by BOINC or a timer of your choice. ID: 6346 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6347 - Posted: 3 May 2019, 21:51:19 UTC - in response to Message 6340. Tried to integrate Theory native into my existing cgroups hierarchy but so far without full success. Hence I either run it in my own (sub)cgroup to get a balanced scheduling among all projects or I run it in the project's cgroup to get suspend/resume running. Needs more knowledge/testing. From you or us? ID: 6347 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 6348 - Posted: 3 May 2019, 21:53:13 UTC - in response to Message 6336. delay_bound is currently set to 200000 which is 55.6 hours. What would you suggest setting it to? The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week. That is surprising. We’re those jobs fine otherwise? ID: 6348 · Rating: 0 · rate: / Reply Quote

Development for LHC@home