Message boards :
CMS Application :
Dip?
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
CMS running jobs are falling. I haven't seen anything yet. Anyone? A link to a failed WU? It's way past my bed-time. See you in 6+ hours and hope it's a false alarm... |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
Laurence send yesterday a message about a dev Server upgrade today. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
Laurence send yesterday a message about a dev Server upgrade today. That shouldn't have started to affect us last night, In any event, the trend didn't continue. We've fewer jobs running now, but still more than this time last week. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I guess, the high numbers last week were down to the fact, that Theory was down. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
I guess, the high numbers last week were down to the fact, that Theory was down. Exactly. Laurence, Ben, et al. opened CMS@Home up to "production" status at vLHC@Home so users there started gobbling down jobs. For a while we nearly tripled our previous job rate, we're now settling down at about twice. BTW, the new workflow we've been running the last month or so is now approaching 10 billion events processed. I've heard back from the people who requested the jobs that the results look good, but no reports of analysis data yet. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I've heard back from the people who requested the jobs that the results look good, It is nice, that it is actually being used. Unlike Seti, where it is very doubtful, that there will ever be a "result" of any kind. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok? We're not sure where that is coming from. Other monitoring doesn't show a problem. It went away after a while, but may be coming back. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
Has anyone spotted where the sudden increase of running CMS jobs today has come from? I checked earlier and there wasn't a spike in new sign-ups in -dev or production projects. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
There was an issue with the Theory jobs so CMS has stolen some extra resources :) EDIT: Shouldn't this be in an anti-dip thread? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
There was an issue with the Theory jobs so CMS has stolen some extra resources :)Ah, OK. EDIT: Shouldn't this be in an anti-dip thread? Yes, well, ...but... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
I've heard back from the people who requested the jobs that the results look good, You may have noticed a reference to my trip to Ambleside last week for a collaboration meeting for GRIDPP, the organisation of Universities and Laboratories which runs the UK Grid computing network for the LHC experiments (plus some extra for other science projects). I gave a talk on CMS@Home, including a partial analysis of the data we have been generating. The talk is publicly available, though obviously without my running commentary on the slides. You should be able to view it as PDF or PowerPoint(TM). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No jobs? I am trying a new OS. Is it me, or is there a problem? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
No jobs? There are jobs. I'm puzzled by the dip, I'll take a closer look. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
Puzzled by this. Anyone seen it before in StartLog? 09/07/16 21:22:19 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 130.246.180.120,lcggwms02.gridpp.rl.ac.uk, hostname size = 1, original ip address = 130.246.180.120 09/07/16 21:22:19 Request accepted. 09/07/16 21:22:19 Remote owner is cms005@lcggwms02.gridpp.rl.ac.uk 09/07/16 21:22:19 State change: claiming protocol successful 09/07/16 21:22:19 Changing state: Unclaimed -> Claimed 09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 501 (DELEGATE_GSI_CRED_STARTD), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 09/07/16 21:22:20 GLEXEC_STARTER is false, cancelling delegation 09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 09/07/16 21:22:20 Got activate_claim request from shadow (130.246.180.120) 09/07/16 21:22:20 Remote job ID is 1375089.0 09/07/16 21:22:20 Got universe "VANILLA" (5) from request classad 09/07/16 21:22:20 State change: claim-activation protocol successful 09/07/16 21:22:20 Changing activity: Idle -> Busy |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
No, that must be a red herring. I've got a CMS job now and it's running -- admittedly on the production project, not here on -dev. So, jobs are available. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
It might be a reporting problem again. The graphs we use for CMS Jobs show job completion down to ~40/hour but I'm seeing twice that on the condor server, though it was down a bit earlier on, but still above 60/hour. For the last three full hours: [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 20'|wc 85 765 10285 [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 19'|wc 65 585 7865 [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 18'|wc 68 612 8228 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Did you find out, what caused the last dip a few days ago? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
Did you find out, what caused the last dip a few days ago? Yeah, that was my mistake. Somehow I misread the queue status last Friday night, after a rather stressful day, and went to bed without submitting a new batch when I should have... Woke up to find it empty, so a quick CRAB3 submission of another 10,000 jobs quickly ensued. That reminds me, I'd better submit a new batch tonight, too! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 46 |
There is something definitely wrong with the Dashboard reporting. The headline numbers on the "CMS Jobs" page come (in)directly from the condor server and correspond with what I see when I log into the machine. The Dashboard plots are not consistent with those data. I see the number of jobs in WNPostProc reported by Dashboard are much higher than usual, but I'm not sure if that's enough to account for the discrepancies. I still think it's a Dashboard communications problem. |
©2024 CERN