Thread 'Infrastructure Issues'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 2065 - Posted: 19 Feb 2016, 15:58:45 UTC There is an issue with one of our servers that will stop new glideins (runs) from working. In theory the VMs should just idle until this is fixed. ID: 2065 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 2067 - Posted: 19 Feb 2016, 20:20:24 UTC - in response to Message 2065. The issue has been fixed. Jobs should be working again. ID: 2067 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 3578 - Posted: 18 Jun 2016, 20:37:01 UTC Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time. CMS jobs. ID: 3578 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3579 - Posted: 18 Jun 2016, 21:36:04 UTC - in response to Message 3578. Last modified: 18 Jun 2016, 21:46:22 UTC Something is really wrong.The numbers of connected machines plummeted to nearly zero in a very short time. CMS jobs. Yes, I think we are on it* -- modulo its being nearly midnight Saturday at CERN of course... It seems some jobs that shouldn't be running at T3_CH_Volunteer have snuck into our queue and stolen my slots, possibly because they were submitted by a former developer for the CMS@Home project. It's not something I can fix myself, but it sounds like it should be trivial. I know, famous last words... * I.e. there are e-mails flying between RAL and CERN with proposed fixes. ID: 3579 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 3580 - Posted: 18 Jun 2016, 21:51:00 UTC - in response to Message 3579. It is just sad, that whenever it seems doing well and being stable, another issue hits. ID: 3580 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3581 - Posted: 18 Jun 2016, 22:09:19 UTC - in response to Message 3580. It is just sad, that whenever it seems doing well and being stable, another issue hits. I guess that's just a reflection of the complex interdependence of many steps in the workflow process -- if you have, say, 12 steps each with a 99.5% chance of success, that still works out to only a 94.2% rate overall. To make a parallel in the real world, look at Elon Musk's rate of success in landing his SpaceX boosters back on Earth; he's had several successes lately but the last one impacted hard on the landing barge because it ran out of LOX for the engines moments before touchdown. So near, yet so far... That said, we're not losing results from the project, just efficiency as your hosts try -- and fail -- to run the spurious jobs. We're halfway through the queue of 3 batches of 300 jobs each; we should have cleared them in another several hours even if CERN doesn't get a fix in place tonight. It's a learning process. :-) ID: 3581 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 3582 - Posted: 18 Jun 2016, 22:14:23 UTC - in response to Message 3581. Thanks, Ivan. I am amazed, that a system as complex as it is, still kind of works. ID: 3582 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3583 - Posted: 18 Jun 2016, 22:45:38 UTC - in response to Message 3582. Basically, you don't know until you try! Have you tried following just the trials and tribulations of the LHC as it strives to get better and better physics-data runs of the proton beams? Not to forget the extra uncertainties for CMS as subsystem problems intrude on our data taking. I believe it's (mostly) public if you know where to look. The wonder is that it's working at all. BTW, in case you haven't heard, this month CMS released results from last year which suggest new physics in a resonance at ~750 Gev. (See, e.g. [1]). If I read the charts[2] correctly, we have already collected more data this year than all of last year (and are probably going to collect at least three times more by the end of this run), so the results presented at this summer's conferences could be very interesting! [1] https://en.wikipedia.org/wiki/750_GeV_diphoton_excess [2] https://twiki.cern.ch/twiki/bin/view/CMSPublic/LumiPublicResults ID: 3583 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3584 - Posted: 19 Jun 2016, 2:04:21 UTC - in response to Message 3579. OK, the rogue jobs are all but exhausted and our statistics are starting to climb again. Not sure if we enabled defences against the "invaders" but I think I can go to bed now, happy that the worst is over. Night-night... ID: 3584 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3593 - Posted: 21 Jun 2016, 13:53:21 UTC The current dip in the CMS@Home job graph is "intentional" -- we've unstuck some of Hassen's old WMAgent jobs and they are now running (about 200 in total). We have a new associate trying to submit more WMAgent jobs but up until now she's not had success. ID: 3593 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 3595 - Posted: 22 Jun 2016, 10:33:26 UTC - in response to Message 3593. Is this dip still intentional? We are down to 30 jobs running. ID: 3595 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 3596 - Posted: 22 Jun 2016, 11:08:23 UTC - in response to Message 3595. Is this dip still intentional? We are down to 30 jobs running. No, there was a misconfiguration last night that led to jobs not being available. Unfortunately it happened around the time I went to bed, so I didn't see it until this morning. Seems like the lack of jobs was logged to BOINC as a computing error and many hosts went into back-off mode, which limits the number of jobs per day until "trust" is established again. Pending correction of the problem, I've worked out how to tweak jobs in the Condor queue (similar to what I used to have to do) so as hosts are able to start tasks again they should receive jobs. We got down to around 10 jobs at one point, we're slowly climbing again now. All of my running machines are in back-off; I'll try periodically to tickle them to see if their quota has increased. ID: 3596 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,852 RAC: 0	Message 3598 - Posted: 22 Jun 2016, 14:14:23 UTC to tickle them to see if their quota has increased. If i hit the quota, i report a task manually asap to get two new ones. I have to time the manual shut down, not to "abandon" a job. That only works, if you have at least one task running. ID: 3598 · Rating: 0 · rate: / Reply Quote

Development for LHC@home