Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4118 - Posted: 8 Sep 2016, 14:12:22 UTC - in response to Message 4117.  
Last modified: 8 Sep 2016, 14:27:03 UTC

I've just discovered that there is a problem with normal CRAB3 jobs, the symptom being jobs stuck in post-processing. The cause is apparently a certificate which expired yesterday. I'm not sure if this affects us, but it's a possible reason for so many jobs in PostProc.
[Edit] Maybe not. I checked the most recent jobs shown as still in PostProc and they have definitely finished, staged their file and run PostProc, so I'm still leaning towards a Dashboard glitch. [/E]
ID: 4118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4119 - Posted: 9 Sep 2016, 18:01:07 UTC - in response to Message 4118.  
Last modified: 9 Sep 2016, 18:05:44 UTC

I'm still puzzled, but the weekend is not the time to bother our ex-student on the Dashboard team. There is a huge discrepancy between the actual condor server and the Dashboard results (and plots):

Dashboard:
Task: 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T
NJobTotal: 10000 Pending: 135 Running: 0 Unknown: 28 Cancelled: 0 Success: 9083 Failed: 193 WNPostProc: 561 ToRetry: 0

Server:
[cms005@lcggwms02:~] > ./stats.sh 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T
      7   NodeStatus = 3; /* "STATUS_SUBMITTED" */
   9801   NodeStatus = 5; /* "STATUS_DONE" */
    192   NodeStatus = 6; /* "STATUS_ERROR" */


Dashboard:
Task: 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U
NJobTotal: 9914 Pending: 7906 Running: 153 Unknown: 4 Cancelled: 0 Success: 1027 Failed: 25 WNPostProc: 797 ToRetry: 2

Server:
[cms005@lcggwms02:~] > ./stats.sh 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U
   6652   NodeStatus = 1; /* "STATUS_READY" */ (in pending queue)
   1258   NodeStatus = 3; /* "STATUS_SUBMITTED" */ (~258 running, ~1000 in local queue)
   2059   NodeStatus = 5; /* "STATUS_DONE" */
     31   NodeStatus = 6; /* "STATUS_ERROR" */

ID: 4119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4120 - Posted: 15 Sep 2016, 13:02:28 UTC - in response to Message 4119.  

The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend.
ID: 4120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4125 - Posted: 18 Sep 2016, 7:19:56 UTC - in response to Message 4120.  

The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend.

Early days yet, but 500 jobs have completed and Dashboard seems to have gone back to "normal".
ID: 4125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4136 - Posted: 20 Sep 2016, 17:42:13 UTC

What's up?

All down??
ID: 4136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4137 - Posted: 20 Sep 2016, 19:21:25 UTC - in response to Message 4136.  
Last modified: 20 Sep 2016, 19:22:12 UTC

What's up?

All down??

Yes, and no. There was a problem yesterday that seemingly coincided with CERN IT taking down a faulty network router to roll-back its firmware to avoid a bug. Access to the conditions database was failing, and along with that all the jobs. As well, the condor server stopped moving ready jobs into the execution key, so we ran out of jobs around midnight.
Things were supposed to have been fixed but since we weren't running jobs, I couldn't tell if the frontier errors had gone away. Then tonight I had a thought (since all the required condor processes seemed to be running), maybe it was the batch that was hung, not the server. So I submitted a short batch of 1,000 jobs, and lo and behold! they all got moved into the execution queue (which normally holds ~1,000 per batch). And they are being picked up and run, the count is slowly increasing as more hosts are able to pick them up. It'll be a few hours before I can tell if the frontier error has been cured; but if they were failing early the errors will come in early, and some have been running almost an hour already.
NJobTotal: 1000 Pending: 970 Running: 30 Unknown: 0 Cancelled: 0 Success: 0 Failed: 0 WNPostProc: 0 ToRetry: 0
ID: 4137 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4138 - Posted: 20 Sep 2016, 19:37:23 UTC

Thanks for the info.
I have been fiddling with all sorts of Linux distros.
What a mess!
But quite interesting, though.
I kind of settled with Scientific Linux,for now.
I will see, how it goes.

I has been very quiet her for the last few weeks.Nothing new going on.
ID: 4138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4140 - Posted: 20 Sep 2016, 22:10:32 UTC - in response to Message 4138.  

Thanks for the info.
I have been fiddling with all sorts of Linux distros.
What a mess!
But quite interesting, though.
I kind of settled with Scientific Linux,for now.
I will see, how it goes.

I has been very quiet her for the last few weeks.Nothing new going on.

Well, there was the usual "August effect", which sometimes baffles me as I'm used to taking summer holidays in December, with the Christmas/New-Year break added on...

Anyway, things are looking good at present. Not many jobs finished yet, but no errors so far. I'm wondering if some of the errors we'd been seeing, especially the failures to connect to the cvmfs servers, were actually due to the flaky CERN border network router?
ID: 4140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4294 - Posted: 7 Nov 2016, 22:37:51 UTC

Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host.
ID: 4294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4295 - Posted: 7 Nov 2016, 23:20:00 UTC

No, but has there been any progress in finding out, why only single core operation is possible?
ID: 4295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4296 - Posted: 7 Nov 2016, 23:48:24 UTC - in response to Message 4295.  

No, but has there been any progress in finding out, why only single core operation is possible?

Oh, really? I hadn't noticed that.
I've two tasks running on my -dev host, each running 8-core tasks
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=283633

My app_config.xml is
[eesridr@brphab BOINC]$ cat projects/lhcathomedev.cern.ch_vLHCathome-dev/app_config.xml
<app_config>
<project_max_concurrent>1</project_max_concurrent>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>ALICE</name>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>CMS</name>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>LHCb</name>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>Theory</name>
<max_concurrent>1</max_concurrent>
</app>
</app_config>


and my preferences in Your Account are

Separate preferences for work

Resource share	50
Use CPU	
Run only the selected applications	CMS Simulation: yes
LHCb Simulation: no
Theory Simulation: no
ATLAS Simulation: no
ALICE Simulation: no
Sixtrack Simulation: no
Benchmark Application: no
If no work for selected applications is available, accept work from other applications?	no
Max # jobs	2
Max # CPUs	8


Seems preferences over-ride app-config? Or else I'm confused this late at night...
ID: 4296 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4297 - Posted: 7 Nov 2016, 23:50:12 UTC - in response to Message 4294.  

Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host.

Ah, I'm seeing errors in the production project!
ID: 4297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4298 - Posted: 7 Nov 2016, 23:52:25 UTC - in response to Message 4297.  
Last modified: 7 Nov 2016, 23:52:39 UTC

Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host.

Ah, I'm seeing errors in the production project!

May have been transient, no errors since 2021 UTC.
ID: 4298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4310 - Posted: 10 Nov 2016, 23:21:02 UTC - in response to Message 4298.  

Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host.

Ah, I'm seeing errors in the production project!

May have been transient, no errors since 2021 UTC.

I may have spoken too soon; all my -dev tasks are ending this way now, but production seems OK. This may explain why our running task count is down.
I've emailed Laurence, I don't think it's anything I can sort out.
ID: 4310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4311 - Posted: 11 Nov 2016, 17:29:21 UTC - in response to Message 4310.  

Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host.

Ah, I'm seeing errors in the production project!

May have been transient, no errors since 2021 UTC.

I may have spoken too soon; all my -dev tasks are ending this way now, but production seems OK. This may explain why our running task count is down.
I've emailed Laurence, I don't think it's anything I can sort out.

Laurence fixed this last night; at least I have a task running now, and another waiting...
ID: 4311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4312 - Posted: 13 Nov 2016, 12:55:55 UTC
Last modified: 13 Nov 2016, 13:02:25 UTC

Now we have a DIP!

Not tasks? (There are jobs!)
ID: 4312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4313 - Posted: 13 Nov 2016, 19:39:33 UTC - in response to Message 4312.  
Last modified: 13 Nov 2016, 19:40:28 UTC

Now we have a DIP!

Not tasks? (There are jobs!)

I wondered when someone would notice... :-/ Late Friday afternoon we started getting stage-out errors, a syndrome you might recall has happened at least twice before this year! I contacted Laurence, who alerted CERN IT. For a while I thought it was fixed, as I saw result files appearing on the Databridge. However, I soon realised they were also disappearing from the Databridge! Files typically had a lifetime of 30 to 60 minutes before disappearing.
So there's some sort of time-out leading to a "failure" error and the files being deleted. Happily, they are mostly not being reported as total failures and most are being requeued by Condor. However, there appear to be no successes at all, according to Dashboard anyway.
So, Laurence has stopped CMS@Home jobs being allocated on vLHCathome (tho' not, AFAICT, here on -dev) pending full manpower being available tomorrow at CERN to find and fix the root problem.
As usual, I apologise for the problem; as usual, I am powerless to do anything to fix it. But "watch this space!", there are changes planned in the next week which may lead to more reliable service (and remove the need for me to continually monitor the project and submit new task batches every second day...)
ID: 4313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4314 - Posted: 13 Nov 2016, 20:00:27 UTC
Last modified: 13 Nov 2016, 20:03:32 UTC

BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php

At least when using the new secure project URL.
ID: 4314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 4315 - Posted: 14 Nov 2016, 9:26:25 UTC - in response to Message 4314.  

BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php

At least when using the new secure project URL.

Thanks for the heads up!
It's been restarted.

Ben and Nils
ID: 4315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,173
RAC: 266
Message 4331 - Posted: 16 Nov 2016, 17:37:58 UTC

After several days of head-scratching we now seem to be back in business, but there are probably some long-running jobs to pollute the result pool when they eventually fail. We have a new Data Bridge for the results now, but that should be transparent to the volunteers.
ID: 4331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN