Dip?

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4118 - Posted: 8 Sep 2016, 14:12:22 UTC - in response to Message 4117. Last modified: 8 Sep 2016, 14:27:03 UTC I've just discovered that there is a problem with normal CRAB3 jobs, the symptom being jobs stuck in post-processing. The cause is apparently a certificate which expired yesterday. I'm not sure if this affects us, but it's a possible reason for so many jobs in PostProc. [Edit] Maybe not. I checked the most recent jobs shown as still in PostProc and they have definitely finished, staged their file and run PostProc, so I'm still leaning towards a Dashboard glitch. [/E] ID: 4118 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4119 - Posted: 9 Sep 2016, 18:01:07 UTC - in response to Message 4118. Last modified: 9 Sep 2016, 18:05:44 UTC I'm still puzzled, but the weekend is not the time to bother our ex-student on the Dashboard team. There is a huge discrepancy between the actual condor server and the Dashboard results (and plots): Dashboard: Task: 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T NJobTotal: 10000 Pending: 135 Running: 0 Unknown: 28 Cancelled: 0 Success: 9083 Failed: 193 WNPostProc: 561 ToRetry: 0 Server: [cms005@lcggwms02:~] > ./stats.sh 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T 7 NodeStatus = 3; /* "STATUS_SUBMITTED" / 9801 NodeStatus = 5; / "STATUS_DONE" / 192 NodeStatus = 6; / "STATUS_ERROR" / Dashboard: Task: 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U NJobTotal: 9914* Pending: 7906 Running: 153 Unknown: 4 Cancelled: 0 Success: 1027 Failed: 25 WNPostProc: 797 ToRetry: 2 Server: [cms005@lcggwms02:~] > ./stats.sh 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U 6652 NodeStatus = 1; /* "STATUS_READY" / (in pending queue) 1258 NodeStatus = 3; / "STATUS_SUBMITTED" / (~258 running, ~1000 in local queue) 2059 NodeStatus = 5; / "STATUS_DONE" / 31 NodeStatus = 6; / "STATUS_ERROR" */ ID: 4119 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4120 - Posted: 15 Sep 2016, 13:02:28 UTC - in response to Message 4119. The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend. ID: 4120 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4125 - Posted: 18 Sep 2016, 7:19:56 UTC - in response to Message 4120. The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend. Early days yet, but 500 jobs have completed and Dashboard seems to have gone back to "normal". ID: 4125 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4136 - Posted: 20 Sep 2016, 17:42:13 UTC What's up? All down?? ID: 4136 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4137 - Posted: 20 Sep 2016, 19:21:25 UTC - in response to Message 4136. Last modified: 20 Sep 2016, 19:22:12 UTC What's up? All down?? Yes, and no. There was a problem yesterday that seemingly coincided with CERN IT taking down a faulty network router to roll-back its firmware to avoid a bug. Access to the conditions database was failing, and along with that all the jobs. As well, the condor server stopped moving ready jobs into the execution key, so we ran out of jobs around midnight. Things were supposed to have been fixed but since we weren't running jobs, I couldn't tell if the frontier errors had gone away. Then tonight I had a thought (since all the required condor processes seemed to be running), maybe it was the batch that was hung, not the server. So I submitted a short batch of 1,000 jobs, and lo and behold! they all got moved into the execution queue (which normally holds ~1,000 per batch). And they are being picked up and run, the count is slowly increasing as more hosts are able to pick them up. It'll be a few hours before I can tell if the frontier error has been cured; but if they were failing early the errors will come in early, and some have been running almost an hour already. NJobTotal: 1000 Pending: 970 Running: 30 Unknown: 0 Cancelled: 0 Success: 0 Failed: 0 WNPostProc: 0 ToRetry: 0 ID: 4137 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4138 - Posted: 20 Sep 2016, 19:37:23 UTC Thanks for the info. I have been fiddling with all sorts of Linux distros. What a mess! But quite interesting, though. I kind of settled with Scientific Linux,for now. I will see, how it goes. I has been very quiet her for the last few weeks.Nothing new going on. ID: 4138 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4140 - Posted: 20 Sep 2016, 22:10:32 UTC - in response to Message 4138. Thanks for the info. I have been fiddling with all sorts of Linux distros. What a mess! But quite interesting, though. I kind of settled with Scientific Linux,for now. I will see, how it goes. I has been very quiet her for the last few weeks.Nothing new going on. Well, there was the usual "August effect", which sometimes baffles me as I'm used to taking summer holidays in December, with the Christmas/New-Year break added on... Anyway, things are looking good at present. Not many jobs finished yet, but no errors so far. I'm wondering if some of the errors we'd been seeing, especially the failures to connect to the cvmfs servers, were actually due to the flaky CERN border network router? ID: 4140 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4294 - Posted: 7 Nov 2016, 22:37:51 UTC Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. ID: 4294 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4295 - Posted: 7 Nov 2016, 23:20:00 UTC No, but has there been any progress in finding out, why only single core operation is possible? ID: 4295 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4296 - Posted: 7 Nov 2016, 23:48:24 UTC - in response to Message 4295. No, but has there been any progress in finding out, why only single core operation is possible? Oh, really? I hadn't noticed that. I've two tasks running on my -dev host, each running 8-core tasks http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=283633 My app_config.xml is [eesridr@brphab BOINC]$ cat projects/lhcathomedev.cern.ch_vLHCathome-dev/app_config.xml <app_config> <project_max_concurrent>1</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>ALICE</name> <max_concurrent>1</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>LHCb</name> <max_concurrent>1</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>1</max_concurrent> </app> </app_config> and my preferences in Your Account are Separate preferences for work Resource share 50 Use CPU Run only the selected applications CMS Simulation: yes LHCb Simulation: no Theory Simulation: no ATLAS Simulation: no ALICE Simulation: no Sixtrack Simulation: no Benchmark Application: no If no work for selected applications is available, accept work from other applications? no Max # jobs 2 Max # CPUs 8 Seems preferences over-ride app-config? Or else I'm confused this late at night... ID: 4296 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4297 - Posted: 7 Nov 2016, 23:50:12 UTC - in response to Message 4294. Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Ah, I'm seeing errors in the production project! ID: 4297 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4298 - Posted: 7 Nov 2016, 23:52:25 UTC - in response to Message 4297. Last modified: 7 Nov 2016, 23:52:39 UTC Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Ah, I'm seeing errors in the production project! May have been transient, no errors since 2021 UTC. ID: 4298 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4310 - Posted: 10 Nov 2016, 23:21:02 UTC - in response to Message 4298. Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Ah, I'm seeing errors in the production project! May have been transient, no errors since 2021 UTC. I may have spoken too soon; all my -dev tasks are ending this way now, but production seems OK. This may explain why our running task count is down. I've emailed Laurence, I don't think it's anything I can sort out. ID: 4310 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4311 - Posted: 11 Nov 2016, 17:29:21 UTC - in response to Message 4310. Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Ah, I'm seeing errors in the production project! May have been transient, no errors since 2021 UTC. I may have spoken too soon; all my -dev tasks are ending this way now, but production seems OK. This may explain why our running task count is down. I've emailed Laurence, I don't think it's anything I can sort out. Laurence fixed this last night; at least I have a task running now, and another waiting... ID: 4311 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4312 - Posted: 13 Nov 2016, 12:55:55 UTC Last modified: 13 Nov 2016, 13:02:25 UTC Now we have a DIP! Not tasks? (There are jobs!) ID: 4312 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4313 - Posted: 13 Nov 2016, 19:39:33 UTC - in response to Message 4312. Last modified: 13 Nov 2016, 19:40:28 UTC Now we have a DIP! Not tasks? (There are jobs!) I wondered when someone would notice... :-/ Late Friday afternoon we started getting stage-out errors, a syndrome you might recall has happened at least twice before this year! I contacted Laurence, who alerted CERN IT. For a while I thought it was fixed, as I saw result files appearing on the Databridge. However, I soon realised they were also disappearing from the Databridge! Files typically had a lifetime of 30 to 60 minutes before disappearing. So there's some sort of time-out leading to a "failure" error and the files being deleted. Happily, they are mostly not being reported as total failures and most are being requeued by Condor. However, there appear to be no successes at all, according to Dashboard anyway. So, Laurence has stopped CMS@Home jobs being allocated on vLHCathome (tho' not, AFAICT, here on -dev) pending full manpower being available tomorrow at CERN to find and fix the root problem. As usual, I apologise for the problem; as usual, I am powerless to do anything to fix it. But "watch this space!", there are changes planned in the next week which may lead to more reliable service (and remove the need for me to continually monitor the project and submit new task batches every second day...) ID: 4313 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4314 - Posted: 13 Nov 2016, 20:00:27 UTC Last modified: 13 Nov 2016, 20:03:32 UTC BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php At least when using the new secure project URL. ID: 4314 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 4315 - Posted: 14 Nov 2016, 9:26:25 UTC - in response to Message 4314. BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php At least when using the new secure project URL. Thanks for the heads up! It's been restarted. Ben and Nils ID: 4315 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,173 RAC: 266	Message 4331 - Posted: 16 Nov 2016, 17:37:58 UTC After several days of head-scratching we now seem to be back in business, but there are probably some long-running jobs to pollute the result pool when they eventually fail. We have a new Data Bridge for the results now, but that should be transparent to the volunteers. ID: 4331 · Rating: 0 · rate: / Reply Quote

Development for LHC@home