Thread 'Dip?'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3845 - Posted: 29 Jul 2016, 0:03:16 UTC CMS running jobs are falling. I haven't seen anything yet. Anyone? A link to a failed WU? It's way past my bed-time. See you in 6+ hours and hope it's a false alarm... ID: 3845 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 793 Credit: 4,220,534 RAC: 4,567	Message 3846 - Posted: 29 Jul 2016, 5:56:27 UTC Laurence send yesterday a message about a dev Server upgrade today. ID: 3846 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3847 - Posted: 29 Jul 2016, 7:30:27 UTC - in response to Message 3846. Laurence send yesterday a message about a dev Server upgrade today. That shouldn't have started to affect us last night, In any event, the trend didn't continue. We've fewer jobs running now, but still more than this time last week. ID: 3847 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3848 - Posted: 29 Jul 2016, 7:48:34 UTC - in response to Message 3847. I guess, the high numbers last week were down to the fact, that Theory was down. ID: 3848 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3853 - Posted: 29 Jul 2016, 9:47:51 UTC - in response to Message 3848. I guess, the high numbers last week were down to the fact, that Theory was down. Exactly. Laurence, Ben, et al. opened CMS@Home up to "production" status at vLHC@Home so users there started gobbling down jobs. For a while we nearly tripled our previous job rate, we're now settling down at about twice. BTW, the new workflow we've been running the last month or so is now approaching 10 billion events processed. I've heard back from the people who requested the jobs that the results look good, but no reports of analysis data yet. ID: 3853 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3855 - Posted: 29 Jul 2016, 10:02:30 UTC - in response to Message 3853. I've heard back from the people who requested the jobs that the results look good, It is nice, that it is actually being used. Unlike Seti, where it is very doubtful, that there will ever be a "result" of any kind. ID: 3855 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 793 Credit: 4,220,534 RAC: 4,567	Message 3950 - Posted: 4 Aug 2016, 8:43:12 UTC In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok? ID: 3950 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3952 - Posted: 4 Aug 2016, 13:09:11 UTC - in response to Message 3950. In CMS Jobs-dashboard is since midnight more red than green shown. Is this ok? We're not sure where that is coming from. Other monitoring doesn't show a problem. It went away after a while, but may be coming back. ID: 3952 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3990 - Posted: 7 Aug 2016, 19:19:57 UTC Has anyone spotted where the sudden increase of running CMS jobs today has come from? I checked earlier and there wasn't a spike in new sign-ups in -dev or production projects. ID: 3990 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 3991 - Posted: 7 Aug 2016, 19:21:21 UTC - in response to Message 3990. Last modified: 7 Aug 2016, 19:22:09 UTC There was an issue with the Theory jobs so CMS has stolen some extra resources :) EDIT: Shouldn't this be in an anti-dip thread? ID: 3991 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 3997 - Posted: 7 Aug 2016, 22:09:21 UTC - in response to Message 3991. There was an issue with the Theory jobs so CMS has stolen some extra resources :) Ah, OK. EDIT: Shouldn't this be in an anti-dip thread? Yes, well, ...but... ID: 3997 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4109 - Posted: 6 Sep 2016, 15:37:01 UTC - in response to Message 3855. I've heard back from the people who requested the jobs that the results look good, It is nice, that it is actually being used. Unlike Seti, where it is very doubtful, that there will ever be a "result" of any kind. You may have noticed a reference to my trip to Ambleside last week for a collaboration meeting for GRIDPP, the organisation of Universities and Laboratories which runs the UK Grid computing network for the LHC experiments (plus some extra for other science projects). I gave a talk on CMS@Home, including a partial analysis of the data we have been generating. The talk is publicly available, though obviously without my running commentary on the slides. You should be able to view it as PDF or PowerPoint(TM). ID: 4109 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4110 - Posted: 7 Sep 2016, 17:07:35 UTC No jobs? I am trying a new OS. Is it me, or is there a problem? ID: 4110 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4111 - Posted: 7 Sep 2016, 18:39:45 UTC - in response to Message 4110. No jobs? I am trying a new OS. Is it me, or is there a problem? There are jobs. I'm puzzled by the dip, I'll take a closer look. ID: 4111 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4112 - Posted: 7 Sep 2016, 19:27:14 UTC - in response to Message 4111. Puzzled by this. Anyone seen it before in StartLog? 09/07/16 21:22:19 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 130.246.180.120,lcggwms02.gridpp.rl.ac.uk, hostname size = 1, original ip address = 130.246.180.120 09/07/16 21:22:19 Request accepted. 09/07/16 21:22:19 Remote owner is cms005@lcggwms02.gridpp.rl.ac.uk 09/07/16 21:22:19 State change: claiming protocol successful 09/07/16 21:22:19 Changing state: Unclaimed -> Claimed 09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 501 (DELEGATE_GSI_CRED_STARTD), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 09/07/16 21:22:20 GLEXEC_STARTER is false, cancelling delegation 09/07/16 21:22:20 PERMISSION DENIED to submit-side@matchsession from host 130.246.180.120 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 09/07/16 21:22:20 Got activate_claim request from shadow (130.246.180.120) 09/07/16 21:22:20 Remote job ID is 1375089.0 09/07/16 21:22:20 Got universe "VANILLA" (5) from request classad 09/07/16 21:22:20 State change: claim-activation protocol successful 09/07/16 21:22:20 Changing activity: Idle -> Busy ID: 4112 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4113 - Posted: 7 Sep 2016, 19:33:49 UTC - in response to Message 4112. Last modified: 7 Sep 2016, 19:34:55 UTC at must be a red herring. I've got a CMS job now and it's running -- admittedly on the production project, not here on [pre]-dev[/pre]. So, jobs are available. ID: 4113 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4114 - Posted: 7 Sep 2016, 20:17:31 UTC - in response to Message 4113. Last modified: 7 Sep 2016, 20:18:15 UTC It might be a reporting problem again. The graphs we use for CMS Jobs show job completion down to ~40/hour but I'm seeing twice that on the condor server, though it was down a bit earlier on, but still above 60/hour. For the last three full hours: [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job.txt.gz\|grep '7 20'\|wc 85 765 10285 [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job.txt.gz\|grep '7 19'\|wc 65 585 7865 [cms005@lcggwms02:~] > ls -l 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T/job*.txt.gz\|grep '7 18'\|wc 68 612 8228 ID: 4114 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4115 - Posted: 7 Sep 2016, 21:06:35 UTC - in response to Message 4114. Did you find out, what caused the last dip a few days ago? ID: 4115 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4116 - Posted: 7 Sep 2016, 21:11:01 UTC - in response to Message 4115. Last modified: 7 Sep 2016, 21:11:30 UTC Did you find out, what caused the last dip a few days ago? Yeah, that was my mistake. Somehow I misread the queue status last Friday night, after a rather stressful day, and went to bed without submitting a new batch when I should have... Woke up to find it empty, so a quick CRAB3 submission of another 10,000 jobs quickly ensued. That reminds me, I'd better submit a new batch tonight, too! ID: 4116 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4117 - Posted: 8 Sep 2016, 0:00:07 UTC - in response to Message 4114. There is something definitely wrong with the Dashboard reporting. The headline numbers on the "CMS Jobs" page come (in)directly from the condor server and correspond with what I see when I log into the machine. The Dashboard plots are not consistent with those data. I see the number of jobs in WNPostProc reported by Dashboard are much higher than usual, but I'm not sure if that's enough to account for the discrepancies. I still think it's a Dashboard communications problem. ID: 4117 · Rating: 0 · rate: / Reply Quote

Development for LHC@home