Message boards :
CMS Application :
Problem with upgrade of BOINC server
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run. We are working on a fix. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Thanks Ivan |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335583 Yeah they would love that over at production Run time 7 hours 53 min 9 sec CPU time 18 hours 34 min 49 sec Validate state Valid Credit 18.47 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
So much or so many ;-) Credit. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run. @Ivan: Jobs, you created yesterday afternoon, are coming trough now. It seems all jobs are exactly the same. Is this on purpose for testing or is that a failure? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
So much or so many ;-) Credit. Yeah, the credit calculation is a mystery: 3335618 2417753 4 Jun 2024, 6:47:30 UTC 4 Jun 2024, 14:18:30 UTC Completed and validated 23,549.03 89,871.33 1,171.84 3335619 2417754 4 Jun 2024, 6:47:30 UTC 4 Jun 2024, 13:48:08 UTC Completed and validated 22,026.96 84,156.92 1,096.09 3335621 2417756 4 Jun 2024, 6:47:30 UTC 4 Jun 2024, 13:47:02 UTC Completed and validated 21,387.70 79,706.69 1,064.28 3335630 2417765 4 Jun 2024, 6:47:30 UTC 4 Jun 2024, 14:25:13 UTC Completed and validated 21,868.83 80,232.25 41.84 |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run. In which way the same? They are Monte Carlo simulations, so while the config file might be the same, they should give different results due to the pseudorandom-number generators. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Again You seem to be unlucky 😢. I'm getting reasonable amounts of credit. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
OK, an update. You'll have noticed tasks are flowing again -- Laurence fixed the OS upgrade problem. We seem also to have finally cracked the new storage configuration files, all production and post-production instances are returning successful completions. For the time being, I am generating workflows for quad-core VMs while we verify that this is truly so. Therefore, you should be setting, in your computing preferences, Max # CPUs = 4 ( and Max # jobs to <= actual CPUs/4). This applies to both CMS@Home and CMS@Home-dev. Apologies to those of you with hosts having fewer than 4 cores... There's a discussion to be had soon as to how we proceed with multicore jobs vs. single core. There are difficulties in mixing the two, perhaps even some we haven't considered yet. I'm trying to gather my thoughts to produce an initial discussion paper. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Again Yes I never have been lucky running CMS no matter what version as far as the Credits but I do prefer them to be Valids Of course I have checked yours before I even mentioned that since I knew you were running some and you don't "hide" your pc's *which it should not be allowed here to even do but we have several doing that so I have to always check the Statistics/Computers and go through them until I find one running a CMS to compare with) The other one I had running at the same time is still running so eventually it should finish as Valid.....as far as the Credits go.....I will be shocked if I even get 500 ......yes I am awake at 3:45am for some reason. I will fire up a few more on the desktops later since I am just on the 16-core laptop next to me just to run things AND be able to check them when I am awake......for some strange reason the Mrs won't allow the 8 other desktops in the bedroom edit: well as usual my dev-itis made me get up and go grab 8 more of the 4-cores here ....ok officially ......goodnight |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
As I wrote it seems (to me), because the package files had the same byte sizes.It seems all jobs are exactly the same. Is this on purpose for testing or is that a failure? In the far past (before Grafana), we could exactly see what sub-tasks a/my machine had done or was biting on. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335869 I'll have to think about finding your job logs on the data-bridge and looking for the cause of your bad luck. 😕 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335869 I found one log from you, for computer 4786 from Wed Jun 5 12:58:16 2024.local, so 20:58 UTC? Currently can't match that to any task in https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4786, but it's gone midnight now and I need some sleep... |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
That is my 16 core laptop So I tried the three 8-core desktops and they all are getting Valids with less than 100 credits each https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335864 Computer ID 1866 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335723 Computer ID 1848 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335653 Computer ID 1816 Ran several on each and this mini-credit started May 3rd and never stopped and I still have more running on each. Last night I decided to try that laptop on the production version of CMS multi and no problem with that one and I just started another there to see if it does the same. https://lhcathome.cern.ch/lhcathome/result.php?resultid=411596961 |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Complaints about credits are useless as long as the logs show a success like: 2024-06-06 16:23:23 (6020): VM Completion Message: glidein exited with return value 0. . . . 2024-06-06 16:23:29 (6020): called boinc_finish(0) For more than a decade credit calculation is done by the BOINC server following a well defined algorithm: https://boinc.berkeley.edu/trac/wiki/CreditNew The bad thing is that the algorithm partly uses it's output to modify input parameters used for the next calculation. This is a "per computer/per project" property, hence the results can't be compared with other computers of the same type or between projects (like dev/prod). It's all fine as long as the computer runs series of tasks under the same load with runtimes as close together as possible. Factors disturbing the balance (examples): - running the same task type under low/full load - series of long running tasks followed by series of short running tasks and vice versa - periods of empty backend queues (since for LHC@home BOINC treats empty envelopes as very short valids) - a mix of singlecore and multicore tasks Once a balance has been found, e.g. during a period of an empty queue, a refilled queue can result in credits being way off (even by several magnitudes) and it takes lots of returned valids to slowly get back to the mean (but it happens automatically). Users then tend to complain about low credits but never about "too much" credit. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
I call BS and I had blocked your comments but decided to see what you said this time. And telling ME here what members might say in production is for you to bother them with over there and never me. As usual you are full of yourself and I was the first member that even had a Valid VB tasks in March 2011 and you were not there AND you must remember (probably not) that I have mentioned insane credits of over 10,000 for no reason many times and this dev site is NOT a Credits contest and the same with the BS about people trying to join here to add it to their long list of Boinc projects or HIDE their computers. (maybe you should use you mystical powers to deal with THAT) So once again I will not read your page of info that I don't need from you as usual where you act like you are a genius here and everyone else is here to play video games. I am only discussing this with Ivan and not you (so go to github) I don't need ANY information from you Stef I'm surprised you didn't underline any of it. (ok I won't read your next........post) |
Send message Joined: 24 Oct 19 Posts: 171 Credit: 543,238 RAC: 245 |
I call BS and I had blocked your comments but decided to see what you said this time. Why not call the police and throw him in jail? |
©2024 CERN