Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3845 - Posted: 29 Jul 2016, 0:03:16 UTC

CMS running jobs are falling. I haven't seen anything yet. Anyone? A link to a failed WU?
It's way past my bed-time. See you in 6+ hours and hope it's a false alarm...
ID: 3845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 2
Message 3846 - Posted: 29 Jul 2016, 5:56:27 UTC

Laurence send yesterday a message about a dev Server upgrade today.
ID: 3846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3847 - Posted: 29 Jul 2016, 7:30:27 UTC - in response to Message 3846.  

Laurence send yesterday a message about a dev Server upgrade today.

That shouldn't have started to affect us last night,
In any event, the trend didn't continue. We've fewer jobs running now, but still more than this time last week.
ID: 3847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3848 - Posted: 29 Jul 2016, 7:48:34 UTC - in response to Message 3847.  

I guess, the high numbers last week were down to the fact, that Theory was down.
ID: 3848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3853 - Posted: 29 Jul 2016, 9:47:51 UTC - in response to Message 3848.  

I guess, the high numbers last week were down to the fact, that Theory was down.

Exactly. Laurence, Ben, et al. opened CMS@Home up to "production" status at vLHC@Home so users there started gobbling down jobs. For a while we nearly tripled our previous job rate, we're now settling down at about twice.

BTW, the new workflow we've been running the last month or so is now approaching 10 billion events processed. I've heard back from the people who requested the jobs that the results look good, but no reports of analysis data yet.
ID: 3853 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3855 - Posted: 29 Jul 2016, 10:02:30 UTC - in response to Message 3853.  

I've heard back from the people who requested the jobs that the results look good,


It is nice, that it is actually being used. Unlike Seti, where it is very doubtful, that there will ever be a "result" of any kind.
ID: 3855 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 2
Message 3950 - Posted: 4 Aug 2016, 8:43:12 UTC

In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok?
ID: 3950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3952 - Posted: 4 Aug 2016, 13:09:11 UTC - in response to Message 3950.  

In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok?

We're not sure where that is coming from. Other monitoring doesn't show a problem. It went away after a while, but may be coming back.
ID: 3952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3990 - Posted: 7 Aug 2016, 19:19:57 UTC

Has anyone spotted where the sudden increase of running CMS jobs today has come from? I checked earlier and there wasn't a spike in new sign-ups in -dev or production projects.
ID: 3990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3991 - Posted: 7 Aug 2016, 19:21:21 UTC - in response to Message 3990.  
Last modified: 7 Aug 2016, 19:22:09 UTC

There was an issue with the Theory jobs so CMS has stolen some extra resources :)

EDIT: Shouldn't this be in an anti-dip thread?
ID: 3991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 3997 - Posted: 7 Aug 2016, 22:09:21 UTC - in response to Message 3991.  

There was an issue with the Theory jobs so CMS has stolen some extra resources :)
Ah, OK.
EDIT: Shouldn't this be in an anti-dip thread?

Yes, well, ...but...
ID: 3997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4109 - Posted: 6 Sep 2016, 15:37:01 UTC - in response to Message 3855.  

I've heard back from the people who requested the jobs that the results look good,


It is nice, that it is actually being used. Unlike Seti, where it is very doubtful, that there will ever be a "result" of any kind.

You may have noticed a reference to my trip to Ambleside last week for a collaboration meeting for GRIDPP, the organisation of Universities and Laboratories which runs the UK Grid computing network for the LHC experiments (plus some extra for other science projects).
I gave a talk on CMS@Home, including a partial analysis of the data we have been generating. The talk is publicly available, though obviously without my running commentary on the slides. You should be able to view it as PDF or PowerPoint(TM).
ID: 4109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4110 - Posted: 7 Sep 2016, 17:07:35 UTC

No jobs?

I am trying a new OS.

Is it me, or is there a problem?
ID: 4110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4111 - Posted: 7 Sep 2016, 18:39:45 UTC - in response to Message 4110.  

No jobs?

I am trying a new OS.

Is it me, or is there a problem?

There are jobs. I'm puzzled by the dip, I'll take a closer look.
ID: 4111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4112 - Posted: 7 Sep 2016, 19:27:14 UTC - in response to Message 4111.  

Puzzled by this. Anyone seen it before in StartLog?
09/07/16 21:22:19 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 130.246.180.120,lcggwms02.gridpp.rl.ac.uk, hostname size = 1, original ip address = 130.246.180.120
09/07/16 21:22:19 Request accepted.
09/07/16 21:22:19 Remote owner is cms005@lcggwms02.gridpp.rl.ac.uk
09/07/16 21:22:19 State change: claiming protocol successful
09/07/16 21:22:19 Changing state: Unclaimed -> Claimed
09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 501 (DELEGATE_GSI_CRED_STARTD), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
09/07/16 21:22:20 GLEXEC_STARTER is false, cancelling delegation
09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
09/07/16 21:22:20 Got activate_claim request from shadow (130.246.180.120)
09/07/16 21:22:20 Remote job ID is 1375089.0
09/07/16 21:22:20 Got universe "VANILLA" (5) from request classad
09/07/16 21:22:20 State change: claim-activation protocol successful
09/07/16 21:22:20 Changing activity: Idle -> Busy

ID: 4112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4113 - Posted: 7 Sep 2016, 19:33:49 UTC - in response to Message 4112.  
Last modified: 7 Sep 2016, 19:34:55 UTC

No, that must be a red herring. I've got a CMS job now and it's running -- admittedly on the production project, not here on
-dev
. So, jobs are available.
ID: 4113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4114 - Posted: 7 Sep 2016, 20:17:31 UTC - in response to Message 4113.  
Last modified: 7 Sep 2016, 20:18:15 UTC

It might be a reporting problem again. The graphs we use for CMS Jobs show job completion down to ~40/hour but I'm seeing twice that on the condor server, though it was down a bit earlier on, but still above 60/hour. For the last three full hours:
[cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 20'|wc
85 765 10285
[cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 19'|wc
65 585 7865
[cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz|grep '7 18'|wc
68 612 8228

ID: 4114 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4115 - Posted: 7 Sep 2016, 21:06:35 UTC - in response to Message 4114.  

Did you find out, what caused the last dip a few days ago?
ID: 4115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4116 - Posted: 7 Sep 2016, 21:11:01 UTC - in response to Message 4115.  
Last modified: 7 Sep 2016, 21:11:30 UTC

Did you find out, what caused the last dip a few days ago?

Yeah, that was my mistake. Somehow I misread the queue status last Friday night, after a rather stressful day, and went to bed without submitting a new batch when I should have... Woke up to find it empty, so a quick CRAB3 submission of another 10,000 jobs quickly ensued. That reminds me, I'd better submit a new batch tonight, too!
ID: 4116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 4117 - Posted: 8 Sep 2016, 0:00:07 UTC - in response to Message 4114.  

There is something definitely wrong with the Dashboard reporting. The headline numbers on the "CMS Jobs" page come (in)directly from the condor server and correspond with what I see when I log into the machine. The Dashboard plots are not consistent with those data. I see the number of jobs in WNPostProc reported by Dashboard are much higher than usual, but I'm not sure if that's enough to account for the discrepancies.
I still think it's a Dashboard communications problem.
ID: 4117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN