Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
I've just discovered that there is a problem with normal CRAB3 jobs, the symptom being jobs stuck in post-processing. The cause is apparently a certificate which expired yesterday. I'm not sure if this affects us, but it's a possible reason for so many jobs in PostProc. [Edit] Maybe not. I checked the most recent jobs shown as still in PostProc and they have definitely finished, staged their file and run PostProc, so I'm still leaning towards a Dashboard glitch. [/E] |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
I'm still puzzled, but the weekend is not the time to bother our ex-student on the Dashboard team. There is a huge discrepancy between the actual condor server and the Dashboard results (and plots): Dashboard: Task: 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T NJobTotal: 10000 Pending: 135 Running: 0 Unknown: 28 Cancelled: 0 Success: 9083 Failed: 193 WNPostProc: 561 ToRetry: 0 Server: [cms005@lcggwms02:~] > ./stats.sh 160903_112049:ireid_crab_BPH-RunIISummer15GS-00046_T 7 NodeStatus = 3; /* "STATUS_SUBMITTED" */ 9801 NodeStatus = 5; /* "STATUS_DONE" */ 192 NodeStatus = 6; /* "STATUS_ERROR" */ Dashboard: Task: 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U NJobTotal: 9914 Pending: 7906 Running: 153 Unknown: 4 Cancelled: 0 Success: 1027 Failed: 25 WNPostProc: 797 ToRetry: 2 Server: [cms005@lcggwms02:~] > ./stats.sh 160907_211730:ireid_crab_BPH-RunIISummer15GS-00046_U 6652 NodeStatus = 1; /* "STATUS_READY" */ (in pending queue) 1258 NodeStatus = 3; /* "STATUS_SUBMITTED" */ (~258 running, ~1000 in local queue) 2059 NodeStatus = 5; /* "STATUS_DONE" */ 31 NodeStatus = 6; /* "STATUS_ERROR" */ |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
The Dashboard team think they have fixed the problem with bad status, but only for new batches. We'll know when the current one drains, sometime over the weekend. Early days yet, but 500 jobs have completed and Dashboard seems to have gone back to "normal". |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What's up? All down?? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
What's up? Yes, and no. There was a problem yesterday that seemingly coincided with CERN IT taking down a faulty network router to roll-back its firmware to avoid a bug. Access to the conditions database was failing, and along with that all the jobs. As well, the condor server stopped moving ready jobs into the execution key, so we ran out of jobs around midnight. Things were supposed to have been fixed but since we weren't running jobs, I couldn't tell if the frontier errors had gone away. Then tonight I had a thought (since all the required condor processes seemed to be running), maybe it was the batch that was hung, not the server. So I submitted a short batch of 1,000 jobs, and lo and behold! they all got moved into the execution queue (which normally holds ~1,000 per batch). And they are being picked up and run, the count is slowly increasing as more hosts are able to pick them up. It'll be a few hours before I can tell if the frontier error has been cured; but if they were failing early the errors will come in early, and some have been running almost an hour already. NJobTotal: 1000 Pending: 970 Running: 30 Unknown: 0 Cancelled: 0 Success: 0 Failed: 0 WNPostProc: 0 ToRetry: 0 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the info. I have been fiddling with all sorts of Linux distros. What a mess! But quite interesting, though. I kind of settled with Scientific Linux,for now. I will see, how it goes. I has been very quiet her for the last few weeks.Nothing new going on. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Thanks for the info. Well, there was the usual "August effect", which sometimes baffles me as I'm used to taking summer holidays in December, with the Christmas/New-Year break added on... Anyway, things are looking good at present. Not many jobs finished yet, but no errors so far. I'm wondering if some of the errors we'd been seeing, especially the failures to connect to the cvmfs servers, were actually due to the flaky CERN border network router? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No, but has there been any progress in finding out, why only single core operation is possible? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
No, but has there been any progress in finding out, why only single core operation is possible? Oh, really? I hadn't noticed that. I've two tasks running on my -dev host, each running 8-core tasks http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=283633 My app_config.xml is [eesridr@brphab BOINC]$ cat projects/lhcathomedev.cern.ch_vLHCathome-dev/app_config.xml <app_config> <project_max_concurrent>1</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>ALICE</name> <max_concurrent>1</max_concurrent> </app> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> </app> <app> <name>LHCb</name> <max_concurrent>1</max_concurrent> </app> <app> <name>Theory</name> <max_concurrent>1</max_concurrent> </app> </app_config> and my preferences in Your Account are Separate preferences for work Resource share 50 Use CPU Run only the selected applications CMS Simulation: yes LHCb Simulation: no Theory Simulation: no ATLAS Simulation: no ALICE Simulation: no Sixtrack Simulation: no Benchmark Application: no If no work for selected applications is available, accept work from other applications? no Max # jobs 2 Max # CPUs 8 Seems preferences over-ride app-config? Or else I'm confused this late at night... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Ah, I'm seeing errors in the production project! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. May have been transient, no errors since 2021 UTC. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. I may have spoken too soon; all my -dev tasks are ending this way now, but production seems OK. This may explain why our running task count is down. I've emailed Laurence, I don't think it's anything I can sort out. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Did anyone see a reason for the dip in running jobs tonight? I only monitor actual Condor jobs, if there are actual task failures I don't see them unless they affect my host. Laurence fixed this last night; at least I have a task running now, and another waiting... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Now we have a DIP! Not tasks? (There are jobs!) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
Now we have a DIP! I wondered when someone would notice... :-/ Late Friday afternoon we started getting stage-out errors, a syndrome you might recall has happened at least twice before this year! I contacted Laurence, who alerted CERN IT. For a while I thought it was fixed, as I saw result files appearing on the Databridge. However, I soon realised they were also disappearing from the Databridge! Files typically had a lifetime of 30 to 60 minutes before disappearing. So there's some sort of time-out leading to a "failure" error and the files being deleted. Happily, they are mostly not being reported as total failures and most are being requeued by Condor. However, there appear to be no successes at all, according to Dashboard anyway. So, Laurence has stopped CMS@Home jobs being allocated on vLHCathome (tho' not, AFAICT, here on -dev) pending full manpower being available tomorrow at CERN to find and fix the root problem. As usual, I apologise for the problem; as usual, I am powerless to do anything to fix it. But "watch this space!", there are changes planned in the next week which may lead to more reliable service (and remove the need for me to continually monitor the project and submit new task batches every second day...) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 50 |
BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php At least when using the new secure project URL. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
BOINC's feeder is not running: https://lhcathome.cern.ch/vLHCathome-dev/server_status.php Thanks for the heads up! It's been restarted. Ben and Nils |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 245 |
After several days of head-scratching we now seem to be back in business, but there are probably some long-running jobs to pollute the result pool when they eventually fail. We have a new Data Bridge for the results now, but that should be transparent to the volunteers. |
©2024 CERN