Message boards : CMS Application : Problem with upgrade of BOINC server
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8454 - Posted: 31 May 2024, 10:37:30 UTC

Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run.
We are working on a fix.
ID: 8454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8456 - Posted: 2 Jun 2024, 10:17:16 UTC

Thanks Ivan
ID: 8456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8457 - Posted: 3 Jun 2024, 20:16:11 UTC - in response to Message 8456.  

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335583


Yeah they would love that over at production

Run time 7 hours 53 min 9 sec
CPU time 18 hours 34 min 49 sec
Validate state Valid
Credit 18.47
ID: 8457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 675
Credit: 1,989,119
RAC: 394
Message 8458 - Posted: 4 Jun 2024, 6:11:13 UTC - in response to Message 8457.  

So much or so many ;-) Credit.
ID: 8458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 857,561
RAC: 33
Message 8459 - Posted: 4 Jun 2024, 7:13:02 UTC - in response to Message 8454.  

Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run.
We are working on a fix.

@Ivan: Jobs, you created yesterday afternoon, are coming trough now.
It seems all jobs are exactly the same. Is this on purpose for testing or is that a failure?
ID: 8459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 857,561
RAC: 33
Message 8460 - Posted: 4 Jun 2024, 14:42:48 UTC - in response to Message 8458.  

So much or so many ;-) Credit.

Yeah, the credit calculation is a mystery:
3335618	2417753	4 Jun 2024, 6:47:30 UTC	4 Jun 2024, 14:18:30 UTC	Completed and validated	23,549.03	89,871.33	1,171.84

3335619	2417754	4 Jun 2024, 6:47:30 UTC	4 Jun 2024, 13:48:08 UTC	Completed and validated	22,026.96	84,156.92	1,096.09

3335621	2417756	4 Jun 2024, 6:47:30 UTC	4 Jun 2024, 13:47:02 UTC	Completed and validated	21,387.70	79,706.69	1,064.28

3335630	2417765	4 Jun 2024, 6:47:30 UTC	4 Jun 2024, 14:25:13 UTC	Completed and validated	21,868.83	80,232.25	   41.84 
ID: 8460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8462 - Posted: 5 Jun 2024, 6:58:19 UTC

ID: 8462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8463 - Posted: 5 Jun 2024, 9:38:47 UTC - in response to Message 8459.  

Apparently there is a problem with the BOINC server after an OS upgrade to RHEL9. The server status display shows zero CMS tasks available even though there are jobs pending. This is affecting creation of new tasks, even though we do have some jobs being run.
We are working on a fix.

@Ivan: Jobs, you created yesterday afternoon, are coming trough now.
It seems all jobs are exactly the same. Is this on purpose for testing or is that a failure?

In which way the same? They are Monte Carlo simulations, so while the config file might be the same, they should give different results due to the pseudorandom-number generators.
ID: 8463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8464 - Posted: 5 Jun 2024, 9:43:40 UTC - in response to Message 8462.  
Last modified: 5 Jun 2024, 9:44:04 UTC

ID: 8464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8465 - Posted: 5 Jun 2024, 9:56:51 UTC

OK, an update. You'll have noticed tasks are flowing again -- Laurence fixed the OS upgrade problem.
We seem also to have finally cracked the new storage configuration files, all production and post-production instances are returning successful completions. For the time being, I am generating workflows for quad-core VMs while we verify that this is truly so. Therefore, you should be setting, in your computing preferences, Max # CPUs = 4 ( and Max # jobs to <= actual CPUs/4). This applies to both CMS@Home and CMS@Home-dev. Apologies to those of you with hosts having fewer than 4 cores...
There's a discussion to be had soon as to how we proceed with multicore jobs vs. single core. There are difficulties in mixing the two, perhaps even some we haven't considered yet. I'm trying to gather my thoughts to produce an initial discussion paper.
ID: 8465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8466 - Posted: 5 Jun 2024, 10:52:57 UTC - in response to Message 8464.  
Last modified: 5 Jun 2024, 11:14:42 UTC

Again
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335623

You seem to be unlucky 😢.
I'm getting reasonable amounts of credit.


Yes I never have been lucky running CMS no matter what version as far as the Credits but I do prefer them to be Valids

Of course I have checked yours before I even mentioned that since I knew you were running some and you don't "hide" your pc's *which it should not be allowed here to even do but we have several doing that so I have to always check the Statistics/Computers and go through them until I find one running a CMS to compare with)

The other one I had running at the same time is still running so eventually it should finish as Valid.....as far as the Credits go.....I will be shocked if I even get 500 ......yes I am awake at 3:45am for some reason.

I will fire up a few more on the desktops later since I am just on the 16-core laptop next to me just to run things AND be able to check them when I am awake......for some strange reason the Mrs won't allow the 8 other desktops in the bedroom

edit: well as usual my dev-itis made me get up and go grab 8 more of the 4-cores here ....ok officially ......goodnight
ID: 8466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8467 - Posted: 5 Jun 2024, 17:51:31 UTC

ID: 8467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 857,561
RAC: 33
Message 8468 - Posted: 5 Jun 2024, 17:52:31 UTC - in response to Message 8463.  

It seems all jobs are exactly the same. Is this on purpose for testing or is that a failure?

In which way the same? They are Monte Carlo simulations, so while the config file might be the same, they should give different results due to the pseudorandom-number generators.
As I wrote it seems (to me), because the package files had the same byte sizes.
In the far past (before Grafana), we could exactly see what sub-tasks a/my machine had done or was biting on.
ID: 8468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8469 - Posted: 6 Jun 2024, 19:14:48 UTC

ID: 8469 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8470 - Posted: 6 Jun 2024, 22:25:35 UTC - in response to Message 8469.  

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335869

I'll have to think about finding your job logs on the data-bridge and looking for the cause of your bad luck. 😕
ID: 8470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,132,735
RAC: 844
Message 8471 - Posted: 6 Jun 2024, 23:19:44 UTC - in response to Message 8470.  

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335869

I'll have to think about finding your job logs on the data-bridge and looking for the cause of your bad luck. 😕

I found one log from you, for computer 4786 from Wed Jun 5 12:58:16 2024.local, so 20:58 UTC? Currently can't match that to any task in https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4786, but it's gone midnight now and I need some sleep...
ID: 8471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8472 - Posted: 7 Jun 2024, 0:45:41 UTC

That is my 16 core laptop
So I tried the three 8-core desktops and they all are getting Valids with less than 100 credits each


https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335864 Computer ID 1866
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335723 Computer ID 1848
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3335653 Computer ID 1816

Ran several on each and this mini-credit started May 3rd and never stopped and I still have more running on each.

Last night I decided to try that laptop on the production version of CMS multi and no problem with that one and I just started another there to see if it does the same.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411596961
ID: 8472 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 481
Credit: 394,720
RAC: 0
Message 8473 - Posted: 7 Jun 2024, 5:53:39 UTC

Complaints about credits are useless as long as the logs show a success like:
2024-06-06 16:23:23 (6020): VM Completion Message: glidein exited with return value 0.
.
.
.
2024-06-06 16:23:29 (6020): called boinc_finish(0)


For more than a decade credit calculation is done by the BOINC server following a well defined algorithm:
https://boinc.berkeley.edu/trac/wiki/CreditNew

The bad thing is that the algorithm partly uses it's output to modify input parameters used for the next calculation.
This is a "per computer/per project" property, hence the results can't be compared with other computers of the same type or between projects (like dev/prod).
It's all fine as long as the computer runs series of tasks under the same load with runtimes as close together as possible.

Factors disturbing the balance (examples):
- running the same task type under low/full load
- series of long running tasks followed by series of short running tasks and vice versa
- periods of empty backend queues (since for LHC@home BOINC treats empty envelopes as very short valids)
- a mix of singlecore and multicore tasks

Once a balance has been found, e.g. during a period of an empty queue, a refilled queue can result in credits being way off (even by several magnitudes) and it takes lots of returned valids to slowly get back to the mean (but it happens automatically).
Users then tend to complain about low credits but never about "too much" credit.
ID: 8473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 777
Credit: 12,074,445
RAC: 5,318
Message 8474 - Posted: 7 Jun 2024, 19:44:42 UTC - in response to Message 8473.  


Users then tend to complain about low credits but never about "too much" credit.

I call BS and I had blocked your comments but decided to see what you said this time.
And telling ME here what members might say in production is for you to bother them with over there and never me.
As usual you are full of yourself and I was the first member that even had a Valid VB tasks in March 2011 and you were not there AND you must remember (probably not) that I have mentioned insane credits of over 10,000 for no reason many times and this dev site is NOT a Credits contest and the same with the BS about people trying to join here to add it to their long list of Boinc projects or HIDE their computers. (maybe you should use you mystical powers to deal with THAT)

So once again I will not read your page of info that I don't need from you as usual where you act like you are a genius here and everyone else is here to play video games.

I am only discussing this with Ivan and not you
(so go to github)

I don't need ANY information from you Stef
I'm surprised you didn't underline any of it. (ok I won't read your next........post)
ID: 8474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boboviz

Send message
Joined: 24 Oct 19
Posts: 164
Credit: 366,621
RAC: 330
Message 8475 - Posted: 7 Jun 2024, 20:00:13 UTC - in response to Message 8474.  

I call BS and I had blocked your comments but decided to see what you said this time.


Why not call the police and throw him in jail?
ID: 8475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Problem with upgrade of BOINC server


©2024 CERN