Thread 'Constructive suggestions please'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 1786 - Posted: 1 Feb 2016, 21:50:44 UTC - in response to Message 1784. I won't go into a lot of detail at this point, but all the big experiments use simulation to predict events and test analysis strategies. The things I've been submitting lately are the "background" events, called minimum-bias. These can then get folded in to simulations of specific interesting interactions, such as the t-tbar interactions we've also produced. For the work I'm currently involved in, an average of 140 min-bias events are added to each t-tbar event to simulate the operating conditions after the next-next upgrade, around 2025 or so. All the work we've done so far is mainly "proof of concept" so I've kept it that simple. The ultimate idea is for people to submit workflows that, for example, look for very rare decays and thus need not return many events -- in much the same way as the experiment actually works, using the high-level trigger to reject events that don't contain "interesting" features and cutting down the data flow enormously. We are still working to better integrate into the global CMS framework, most of you will have noticed Hassen's work on WMAgent submission becoming visible today. We also have to automate return of result files into the GRID network so that they are catalogued and available for any researcher who wants to use them. ID: 1786 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1787 - Posted: 1 Feb 2016, 22:05:13 UTC - in response to Message 1786. Thanks for the explanation, Guys! Is there any way for us volunteers to know, that what we do is actually producing valid results and not computing "blindly" into some black hole? In this project, we can (dashboard), but for other jobs...? ID: 1787 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1789 - Posted: 1 Feb 2016, 22:45:54 UTC - in response to Message 1787. I guess the end result that we are aiming at is something like this http://www.bbc.com/news/science-environment-35444838 Again Ivan (who is busy at the moment) knows better than me but as this is very much based on large number statistics, it is potentially difficult to say this set of data generated by these volunteers did this. The data generated by the volunteers and elsewhere will be combined with the detector data to produce a result i.e. a paper. What we can hope for is acknowledgements in the papers of the efforts and also in presentations of the results etc. If we are lucky maybe in a few years we can make the BBC news too. :) What I can assure you is that the ethical question was discussed in a recent meeting and understood. Also the efforts to build and maintain this is only worth it if it makes a difference to the experiment and as over the next 5-10 years the projections on computing supply and demand differ by a factor of 10, this project may turn out to be quite helpful. ID: 1789 · Rating: 0 · rate: / Reply Quote

Steve Hawker* Send message Joined: 6 Mar 15 Posts: 19 Credit: 142,109 RAC: 0	Message 1790 - Posted: 1 Feb 2016, 22:49:21 UTC Ivan, I've had a lot of success with CMS-Dev, and some failures. But nothing has failed recently. Until I tried the vLHC app. I am used to the 24 hour run time but when the app got to 2 days, I figured it was broken. v46.20 runs on CMS-Dev but not on vLHC, same machine. This is strange, of course. One swallow doesnt make a summer so inconclusive but maybe 46.20 not yet ready? My constructive comment is that you need to run many thousands of WUs with zero unmitigated failures before promoting the app to production. Standard risk management/bug crunching. S. ID: 1790 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1791 - Posted: 1 Feb 2016, 23:14:07 UTC - in response to Message 1789. Thanks Laurence. I am sure the results will be very useful. What i meant is that it is very important that the volunteer knows a result has actually been submitted,passed or failed. Without that info, you are calculating into a "black hole"(nothing comes back). I would not calculate for any project, without any honest feedback. Otherwise you may be calculating for days, weeks or months for nothing. (which kind of happened, when getting credits for a cms-task without actually doing any valid work, just sitting idle) ID: 1791 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 48 Credit: 1,245,799 RAC: 5,429	Message 1792 - Posted: 1 Feb 2016, 23:22:19 UTC - in response to Message 1789. Also the efforts to build and maintain this is only worth it if it makes a difference to the experiment and as over the next 5-10 years the projections on computing supply and demand differ by a factor of 10, this project may turn out to be quite helpful. I think the potential and value of BOINC has been proven by Atlas@home, where BOINC has consistently been one of the largest if not the largest production contributor to Atlas simulations. http://atlasathome.cern.ch/atlas_job.php Also, and as a result I believe, the level of support for BOINC by Atlas has been excellent. So I would expect BOINC to be eventually as important to CMS. ID: 1792 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1906 - Posted: 6 Feb 2016, 23:59:46 UTC It would be really nice, if tasks could be provided without constantly having to ask for them. That is mostly the case for vLHC CMS tasks, not so much here.When they are put on, they are usually gone within hours. If there are no task available, put a short notice on the message board. Is that too much to ask for? ID: 1906 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1910 - Posted: 8 Feb 2016, 10:58:03 UTC - in response to Message 1906. I know you are working on things already but could you let us know the planned order of getting the issues resolved and some sort of timeline for that please ? ID: 1910 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1921 - Posted: 9 Feb 2016, 20:11:08 UTC - in response to Message 1910. I will create a new thread to describe the plan and current progress. ID: 1921 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1945 - Posted: 10 Feb 2016, 16:24:50 UTC - in response to Message 1921. I will create a new thread to describe the plan and current progress. Thought about posting in that thread but decided you might prefer that to be left alone so it was easy to see where you are up to with things. It doesn't mention how the credit will be handled once these fixes/improvements are in place. Do you know how the credit system will work in the various 'shutdown' scenarios ? For example, some events are completed successfully but then it fails (no more jobs, kernel panic, etc) so the task is ended prematurely. Also, you mention the LHCb app does this mean Beauty ? That last worked for me back in June 2015, are you the guys doing that one as well ? Without a forum there was no one to contact. Will LHCb stay where it is or will it now be pulled under the CMS-dev umbrella or are you going to create a new umbrella site for all the dev apps ? ID: 1945 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1957 - Posted: 11 Feb 2016, 8:51:49 UTC - in response to Message 1945. Credit is managed in the standard way by the BOINC client. If the the task is successful credit is given, if if fails it isn't. I will open a new thread specifically on credit issues. Yes, the LHCb app is the Beauty@home project. ID: 1957 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2111 - Posted: 26 Feb 2016, 23:35:23 UTC Would someone please set the "JobLeaseDuration" back to a more realistic value? THIS IS CAUSING MOST OF THE PROBLEMS. A value of six days is totally wrong(a lot of other things time-out by then). Set it at most to 1 day. ID: 2111 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 2113 - Posted: 27 Feb 2016, 13:00:30 UTC - in response to Message 2111. Would someone please set the "JobLeaseDuration" back to a more realistic value? THIS IS CAUSING MOST OF THE PROBLEMS. A value of six days is totally wrong(a lot of other things time-out by then). Set it at most to 1 day. I can try that, but it may mess up things elsewhere. We have increased timeouts for a lot of other things, including having to modify the HTCondor source at one point. ID: 2113 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2114 - Posted: 27 Feb 2016, 20:18:37 UTC - in response to Message 2113. Thanks for the reply,Ivan. I can try that, but it may mess up things elsewhere I think, it is already quite messed up. If any timeouts are changed, it should be in small increments and test. A jump for 20min to 6 days is ill advised. This locks up the task for a long time and presumable causes some, if not all, of the stuck WNPostProc jobs. I would suggest to set it to one day. I would also like to ask , which timeout values have been changed to what and when, including their default values. The important thing is to only change ONE value at a time and carefully watch for an extended period of time(a few days)and not draw any immediate conclusion from a seemingly quick success. I hope, someone keeps a record of what was changed and when. ID: 2114 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 2115 - Posted: 27 Feb 2016, 23:48:31 UTC - in response to Message 2114. Thanks for the reply,Ivan. I can try that, but it may mess up things elsewhere I think, it is already quite messed up. If any timeouts are changed, it should be in small increments and test. A jump for 20min to 6 days is ill advised. This locks up the task for a long time and presumable causes some, if not all, of the stuck WNPostProc jobs. I would suggest to set it to one day. I would also like to ask , which timeout values have been changed to what and when, including their default values. The important thing is to only change ONE value at a time and carefully watch for an extended period of time(a few days)and not draw any immediate conclusion from a seemingly quick success. I hope, someone keeps a record of what was changed and when. I'm pretty sure Laurence can dig up what he changed, when and why -- we have a pretty good e-mail audit trail too. Whether we want to put that out completely into the public domain is another thing. I actually did change JobLeaseDuration back to 1 day after my earlier message, but I'm not sure if that's responsible for the drop off in the top job activities graph (and I'm totally at a loss to explain the behaviour of the other two graphs). Because the sample was so small, I actually changed the 100-job batch I submitted yesterday to a 5-minute JLD and I don't think I've seen anything change in its Condor tasks. So other time-outs are having an effect. Apparently there's some upcoming HTCondor Workshop in Barcelona soon. Andrew (who maintains our server at RAL) and at least one person from CERN that I know of are both going; they will be asking pertinent questions of the developers about if/how we can achieve our aims. I suppose it's too much to hope that you're going too? ID: 2115 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2122 - Posted: 29 Feb 2016, 9:21:05 UTC - in response to Message 2114. Here are the configuration parameters that we have been trying, their current values and the defaults in brackets. NOT_RESPONDING_TIMEOUT 48h (1h) CCB_HEARTBEAT_INTERVAL 24h (20mins) ALIVE_INTERVAL 48h (5mins) JobLeaseDuration 6 days (?) MAX_TIME_SKIP 48h (20mins) http://research.cs.wisc.edu/htcondor/manual/v7.7/3_3Configuration.html Suggestions are welcome. As we had to patch the code it is clear that this is a use case that is not currently supported. This week there is a HTCondor workshop in Barcelona and I hope this will be a topic of discussion. ID: 2122 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2124 - Posted: 29 Feb 2016, 12:29:23 UTC - in response to Message 2122. Hi Laurence The current JobLeaseDuration is 85512sec. The change of all these values have been way too drastic in my opinion. As described here:http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=132&postid=2120 All suspends longer than 20min are causing problems. First, i would set all variables back to default and approach any change one at a time and monitor the effect. The current settings are causing more problems and do not contribute to solving the actual issue. What do you think? ID: 2124 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2125 - Posted: 29 Feb 2016, 13:29:36 UTC - in response to Message 2124. Yes, they are a little extreme just to help identify which parameters are involved are they now need tuning. What is a realistic target for the suspend length? As you suggest maybe we should take it in smaller steps, 1h, 2h, 6h, 24h. ID: 2125 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 2126 - Posted: 29 Feb 2016, 13:44:04 UTC - in response to Message 2125. Last modified: 29 Feb 2016, 13:45:04 UTC Yes, they are a little extreme just to help identify which parameters are involved are they now need tuning. What is a realistic target for the suspend length? As you suggest maybe we should take it in smaller steps, 1h, 2h, 6h, 24h. The default BOINC-suspend when switching between different projects is 60 minutes. More useful for this project in my opinion is a suspend that would survive an overnight PC-power down, so let's say 12 hours seems a good suspension time to me. ID: 2126 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2127 - Posted: 29 Feb 2016, 14:12:08 UTC - in response to Message 2126. Last modified: 29 Feb 2016, 14:27:20 UTC As a target value, that seems OK to me. We might need to approach that in small steps and see, how far we can push that. ID: 2127 · Rating: 0 · rate: / Reply Quote

Development for LHC@home