Message boards : News : Infrastructure Issues
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2065 - Posted: 19 Feb 2016, 15:58:45 UTC

There is an issue with one of our servers that will stop new glideins (runs) from working. In theory the VMs should just idle until this is fixed.
ID: 2065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2067 - Posted: 19 Feb 2016, 20:20:24 UTC - in response to Message 2065.  

The issue has been fixed. Jobs should be working again.
ID: 2067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3578 - Posted: 18 Jun 2016, 20:37:01 UTC

Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time.

CMS jobs.
ID: 3578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3579 - Posted: 18 Jun 2016, 21:36:04 UTC - in response to Message 3578.  
Last modified: 18 Jun 2016, 21:46:22 UTC

Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time.

CMS jobs.

Yes, I think we are on it* -- modulo its being nearly midnight Saturday at CERN of course... It seems some jobs that shouldn't be running at T3_CH_Volunteer have snuck into our queue and stolen my slots, possibly because they were submitted by a former developer for the CMS@Home project. It's not something I can fix myself, but it sounds like it should be trivial.
I know, famous last words...

* I.e. there are e-mails flying between RAL and CERN with proposed fixes.
ID: 3579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3580 - Posted: 18 Jun 2016, 21:51:00 UTC - in response to Message 3579.  

It is just sad, that whenever it seems doing well and being stable, another issue hits.
ID: 3580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3581 - Posted: 18 Jun 2016, 22:09:19 UTC - in response to Message 3580.  

It is just sad, that whenever it seems doing well and being stable, another issue hits.

I guess that's just a reflection of the complex interdependence of many steps in the workflow process -- if you have, say, 12 steps each with a 99.5% chance of success, that still works out to only a 94.2% rate overall. To make a parallel in the real world, look at Elon Musk's rate of success in landing his SpaceX boosters back on Earth; he's had several successes lately but the last one impacted hard on the landing barge because it ran out of LOX for the engines moments before touchdown. So near, yet so far...
That said, we're not losing results from the project, just efficiency as your hosts try -- and fail -- to run the spurious jobs. We're halfway through the queue of 3 batches of 300 jobs each; we should have cleared them in another several hours even if CERN doesn't get a fix in place tonight.

It's a learning process. :-)
ID: 3581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3582 - Posted: 18 Jun 2016, 22:14:23 UTC - in response to Message 3581.  

Thanks, Ivan.
I am amazed, that a system as complex as it is, still kind of works.
ID: 3582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3583 - Posted: 18 Jun 2016, 22:45:38 UTC - in response to Message 3582.  

Basically, you don't know until you try! Have you tried following just the trials and tribulations of the LHC as it strives to get better and better physics-data runs of the proton beams? Not to forget the extra uncertainties for CMS as subsystem problems intrude on our data taking. I believe it's (mostly) public if you know where to look. The wonder is that it's working at all.
BTW, in case you haven't heard, this month CMS released results from last year which suggest new physics in a resonance at ~750 Gev. (See, e.g. [1]). If I read the charts[2] correctly, we have already collected more data this year than all of last year (and are probably going to collect at least three times more by the end of this run), so the results presented at this summer's conferences could be very interesting!

[1] https://en.wikipedia.org/wiki/750_GeV_diphoton_excess
[2] https://twiki.cern.ch/twiki/bin/view/CMSPublic/LumiPublicResults
ID: 3583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3584 - Posted: 19 Jun 2016, 2:04:21 UTC - in response to Message 3579.  

OK, the rogue jobs are all but exhausted and our statistics are starting to climb again. Not sure if we enabled defences against the "invaders" but I think I can go to bed now, happy that the worst is over.
Night-night...
ID: 3584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3593 - Posted: 21 Jun 2016, 13:53:21 UTC

The current dip in the CMS@Home job graph is "intentional" -- we've unstuck some of Hassen's old WMAgent jobs and they are now running (about 200 in total). We have a new associate trying to submit more WMAgent jobs but up until now she's not had success.
ID: 3593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3595 - Posted: 22 Jun 2016, 10:33:26 UTC - in response to Message 3593.  

Is this dip still intentional?
We are down to 30 jobs running.
ID: 3595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3596 - Posted: 22 Jun 2016, 11:08:23 UTC - in response to Message 3595.  

Is this dip still intentional?
We are down to 30 jobs running.

No, there was a misconfiguration last night that led to jobs not being available. Unfortunately it happened around the time I went to bed, so I didn't see it until this morning. Seems like the lack of jobs was logged to BOINC as a computing error and many hosts went into back-off mode, which limits the number of jobs per day until "trust" is established again.
Pending correction of the problem, I've worked out how to tweak jobs in the Condor queue (similar to what I used to have to do) so as hosts are able to start tasks again they should receive jobs. We got down to around 10 jobs at one point, we're slowly climbing again now.
All of my running machines are in back-off; I'll try periodically to tickle them to see if their quota has increased.
ID: 3596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3598 - Posted: 22 Jun 2016, 14:14:23 UTC

to tickle them to see if their quota has increased.


If i hit the quota, i report a task manually asap to get two new ones.
I have to time the manual shut down, not to "abandon" a job.

That only works, if you have at least one task running.
ID: 3598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Infrastructure Issues


©2024 CERN