Message boards :
News :
Infrastructure Issues
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
There is an issue with one of our servers that will stop new glideins (runs) from working. In theory the VMs should just idle until this is fixed. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The issue has been fixed. Jobs should be working again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time. CMS jobs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time. Yes, I think we are on it* -- modulo its being nearly midnight Saturday at CERN of course... It seems some jobs that shouldn't be running at T3_CH_Volunteer have snuck into our queue and stolen my slots, possibly because they were submitted by a former developer for the CMS@Home project. It's not something I can fix myself, but it sounds like it should be trivial. I know, famous last words... * I.e. there are e-mails flying between RAL and CERN with proposed fixes. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It is just sad, that whenever it seems doing well and being stable, another issue hits. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
It is just sad, that whenever it seems doing well and being stable, another issue hits. I guess that's just a reflection of the complex interdependence of many steps in the workflow process -- if you have, say, 12 steps each with a 99.5% chance of success, that still works out to only a 94.2% rate overall. To make a parallel in the real world, look at Elon Musk's rate of success in landing his SpaceX boosters back on Earth; he's had several successes lately but the last one impacted hard on the landing barge because it ran out of LOX for the engines moments before touchdown. So near, yet so far... That said, we're not losing results from the project, just efficiency as your hosts try -- and fail -- to run the spurious jobs. We're halfway through the queue of 3 batches of 300 jobs each; we should have cleared them in another several hours even if CERN doesn't get a fix in place tonight. It's a learning process. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. I am amazed, that a system as complex as it is, still kind of works. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
Basically, you don't know until you try! Have you tried following just the trials and tribulations of the LHC as it strives to get better and better physics-data runs of the proton beams? Not to forget the extra uncertainties for CMS as subsystem problems intrude on our data taking. I believe it's (mostly) public if you know where to look. The wonder is that it's working at all. BTW, in case you haven't heard, this month CMS released results from last year which suggest new physics in a resonance at ~750 Gev. (See, e.g. [1]). If I read the charts[2] correctly, we have already collected more data this year than all of last year (and are probably going to collect at least three times more by the end of this run), so the results presented at this summer's conferences could be very interesting! [1] https://en.wikipedia.org/wiki/750_GeV_diphoton_excess [2] https://twiki.cern.ch/twiki/bin/view/CMSPublic/LumiPublicResults |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
OK, the rogue jobs are all but exhausted and our statistics are starting to climb again. Not sure if we enabled defences against the "invaders" but I think I can go to bed now, happy that the worst is over. Night-night... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
The current dip in the CMS@Home job graph is "intentional" -- we've unstuck some of Hassen's old WMAgent jobs and they are now running (about 200 in total). We have a new associate trying to submit more WMAgent jobs but up until now she's not had success. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is this dip still intentional? We are down to 30 jobs running. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
Is this dip still intentional? No, there was a misconfiguration last night that led to jobs not being available. Unfortunately it happened around the time I went to bed, so I didn't see it until this morning. Seems like the lack of jobs was logged to BOINC as a computing error and many hosts went into back-off mode, which limits the number of jobs per day until "trust" is established again. Pending correction of the problem, I've worked out how to tweak jobs in the Condor queue (similar to what I used to have to do) so as hosts are able to start tasks again they should receive jobs. We got down to around 10 jobs at one point, we're slowly climbing again now. All of my running machines are in back-off; I'll try periodically to tickle them to see if their quota has increased. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
to tickle them to see if their quota has increased. If i hit the quota, i report a task manually asap to get two new ones. I have to time the manual shut down, not to "abandon" a job. That only works, if you have at least one task running. |
©2024 CERN