Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Upsurge in failed jobs in the CMS Jobs page is a bit worrying, but I haven't seen it in other data yet. I'll dig around at RAL. [Later] Log data doesn't support the graph, I can only presume miscommunication between Condor and Dashboard. Failures in the last six hours, covering the blip: tail 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out*|grep 'sh FINISHING'|sort -k 7 |grep -v 'status 0' ... ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 14:19:07 GMT 2015 on 217-481-30189 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 14:46:45 GMT 2015 on 219-351-30055 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:36:43 GMT 2015 on 6-673-31181 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:37:09 GMT 2015 on 285-635-1234 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:43:54 GMT 2015 on 251-716-20080 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:56:38 GMT 2015 on 6-662-8849 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:10:56 GMT 2015 on 6-654-10796 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:19:19 GMT 2015 on 306-782-14078 with (short) status 65 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:33:04 GMT 2015 on 246-777-17398 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:37:00 GMT 2015 on 195-291-12116 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:37:22 GMT 2015 on 261-554-11706 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:42:54 GMT 2015 on 306-782-14078 with (short) status 65 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:51:41 GMT 2015 on 217-481-30189 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:51:48 GMT 2015 on 246-472-26526 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:54:49 GMT 2015 on 219-351-20902 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:30:16 GMT 2015 on 217-336-49 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:32:07 GMT 2015 on 255-751-183 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:33:13 GMT 2015 on 286-636-30679 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:37:59 GMT 2015 on 55-506-15381 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:03:00 GMT 2015 on 275-614-10342 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:29:47 GMT 2015 on 219-351-20902 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:51:57 GMT 2015 on 11-281-12084 with (short) status 86 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:19:52 GMT 2015 on 217-708-20370 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:26:59 GMT 2015 on 217-481-30189 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:35:04 GMT 2015 on 275-614-10342 with (short) status 151 ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 20:12:52 GMT 2015 on 219-351-20902 with (short) status 151 ======== That's 4 or 5 an hour, not the 20+ being indicated. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
A brief comment here, Bill. Background -- I've been delving into CMS code for 13 or 14 years now, so I know how a lot of it is structured. I spent a period with the performance group, benchmarking and optimising the slowest bits (I like to think my efforts led to Higgs discovery analysis finishing as much as 10 days earlier than otherwise; the reality is probably more like 1-2 days). I've also some experience in porting serial code to GPUs, specifically digital hologram reconstruction for a fantastic speedup -- tens of seconds per frame to tens of frames per second. In my opinion CMS code doesn't lend itself well to parallelisation -- there's quite a bit of serial code that kills you under Amdahl's law. Some parts may well be good candidates for parallelisation, but by no means all. The data structures also tend to work against efficient vectorisation. CMSSW is heavily object oriented and data structures tend to be vectors of C++ classes, which means access to data members is via offsets from a class iterator, whereas if data were contained in vectors of data members (as it were) then you just have an iterator down each data vector. One of our programming boffins once showed impressive speedups by refactoring data structures in this way, but it's not been widely adopted. If you do structure your data that way then it lends itself, IMHO, to better vectorisation/parallelisation. |
Send message Joined: 8 Apr 15 Posts: 759 Credit: 11,764,346 RAC: 2,742 |
|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
OK, now errors are showing in the logs. Investigating... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Any progress on the high error rate, yet? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Any progress on the high error rate, yet? No, I can't see anything obvious. Just checked my mail, responses from the experts means I need to do more checks. Obviously I'll let you know as soon as I know anything but it's the weekend, responses may be slow. PS: Doctor Who was rather good tonight! [Edit] Just passed a pertinent log file to the experts -- looks like a Cloud issue insofar as I'm understanding the jargon... [/Edit] |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Any progress on the high error rate, yet? I might have found something mid-afternoon Friday. You know the colour of the failures on the CMS Jobs graph? Well, that may be the colour of my face come tomorrow... I've just submitted a new batch of 2000 "shorties" (25 events), which should get us through the weekend. I'm rather expecting the rate of errors to drop. You see, I idly did a disk check on the Condor server yesterday and found, to my horror, that there was no space left on the / partition. This is, of course, where /home resides and thus it houses all the log files our Condor jobs create... So it's very likely that the post-processing failures that led to unnecessary retries were due to the postproc being unable to write the logs to disk! I'd seen this danger several weeks ago, and Andrew had created a directory for me on another larger partition to copy logs to, pending a decision on whether we needed to keep them or not. So I'd moved a lot over and then got too complacent that we had adequate disk space and hadn't checked for a while. Needless to say, I spent some time moving a few GB of log files off the / partition... You live, you learn! |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 418 |
From this 2000 jobs 173 finished successfully so far, but what's looking strange to me is that 146 are from the site T3_CH_Volunteer (alias BOINC) and 27 are from the site "unknown". |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
From this 2000 jobs 173 finished successfully so far, but what's looking strange The jobs are good, my face is red but I'm not blue. All the "unknowns" I've looked at on Dashboard have a CERN IP address; the first T3_CH_Volunteer I looked at was my Linux box at work! I'm just dashing over to the Condor site to see what else is common with the unknowns... Hmm, T3_CH_Volunteer jobs are finishing as before with FINISHING at Sat Nov 14 03:27:55 GMT 2015 on 9-22-5625 where the 9 is User-ID (me) and the 22 is Host-ID (my Linux box). The unknowns have FINISHING at Sat Nov 14 03:21:25 GMT 2015 on jan-boinc-cms-0bbe4d55-1b84-4830-8d4a-64d45429fa15 FINISHING at Sat Nov 14 03:50:24 GMT 2015 on jan-boinc-cms-16dbe0c8-b17a-410e-a8e7-2c2c84363b7d so it looks like I'll need to look for some other way of identifying failing hosts. :-( Now we did make some changes the last two nights to the bootstrap to pick up our official SITECONFIG because CERN IT wanted to change the port we use for squid requests (to use a new 10 Gbps link) -- it's possible that that is bypassing the code we used to set the hostname but I'm uncertain as yet as to how that would link with a CERN address. I guess wait and see if the proportion of them increases as more volunteers start new 24-hour tasks. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Some suggestion that it's not us filling up the server disk. I cleaned off >2.5 GB yesterday, but it's now back down to 400 MB free -- and we're only using 320 MB in total! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Some suggestion that it's not us filling up the server disk. I cleaned off >2.5 GB yesterday, but it's now back down to 400 MB free -- and we're only using 320 MB in total!...which brings us back to the original topic... We are going to run out of log-file space very soon. Obviously there is little chance that my emails to anyone responsible are going to be read at this hour on a Sunday morning. I've managed to move some other users' files to the backup partition, but without superuser status I can't move all. The "errant" jobs are still writing to disk faster than anything I can now delete. Sorry, but "expect errors imminently". [Edit] [cms005@lcggwms02:~] > df Filesystem | 1K-blocks | Used | Available | Use% | Mounted on /dev/sda2 | 10321208 | 9797036 | 0 | 100%| / [/Edit] :-( |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Have suspended tasks in progress - please let us know when we should resume! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Have suspended tasks in progress - please let us know when we should resume! Actually, it's fine now, Bill. Andrew spent some time on a Sunday morning moving other projects' logs to the backup partition -- for which I have thanked him on our behalf. Latest entries on the CMS Jobs charts show a good recovery. It just adds another entry to the list of things we need to keep a sharp watch on when we ramp up to production. I'm currently compressing the logs on the backup partition, to good effect. I wonder if there's a way to tell Condor to compress the logs directly? |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
All resumed! :-) edit: Except half of them say "Hypervisor failed to enter an online state in a timely manner" and are sleeping for a day, and of course the Mac STILL says "Virtualbox not installed"... |
Send message Joined: 20 Mar 15 Posts: 14 Credit: 5,132 RAC: 0 |
All resumed! :-) The latest Vboxwrapper should resolve that issue. See: https://github.com/BOINC/boinc/releases/tag/vboxwrapper%2F26178 ----- Rom |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Except that update was released on CMS Oct 22, the tasks I currently have running were sent Nov 11-14, gave error message today, and the Mac last gave the 'not installed' message this morning. Win10 Home, latest BOINC. Mac OS X 10.11.1 "El Capitan", ditto. VBox 5.0.8 on all - just got notice of 5.0.10 for Mac, downloading it now. If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 850,198 RAC: 418 |
If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what? A simple restart of BOINC client will retry to start the CMS-task immediately. Vboxwrapper version 26178 is already active. |
Send message Joined: 20 Mar 15 Posts: 14 Credit: 5,132 RAC: 0 |
If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what? Ah, okay, then http://boincai05.cern.ch/CMS-dev/result.php?resultid=66754 was just an old task. ----- Rom |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 28 |
Since restarting everything getting this error from glidein stderr:- /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 101: /home/boinc/CMSRun/glidein_startup.sh: No such file or directory I've reset the project on a couple of hosts (one Linux, one Win7) just to get clean copies but the problem remains. More logs... cron stdout:- 02:38:09 +0000 2015-11-16 [INFO] Downloading glidein 02:39:01 +0000 2015-11-16 [INFO] Starting CMS Application - Run 3 02:39:01 +0000 2015-11-16 [INFO] Reading the BOINC volunteer's information 02:39:03 +0000 2015-11-16 [INFO] Volunteer: m (178) Host: 245 02:39:03 +0000 2015-11-16 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90 02:39:03 +0000 2015-11-16 [INFO] Requesting an X509 credential from CMS-Dev subject : /O=Volunteer Computing/O=CERN/CN=m 178/CN=1839075328 issuer : /O=Volunteer Computing/O=CERN/CN=m 178 identity : /O=Volunteer Computing/O=CERN/CN=m 178 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:34:00 (5.4 days) 02:39:09 +0000 2015-11-16 [INFO] Downloading glidein 02:39:21 +0000 2015-11-16 [INFO] Running glidein (check logs) 02:40:01 +0000 2015-11-16 [INFO] CMS glidein Run 3 ended 02:40:21 +0000 2015-11-16 [INFO] Running glidein (check logs) and so on.... cron stderr:- chmod: cannot access `/home/boinc/CMSRun/glidein_startup.sh': No such file or directory Edit... off to bed leaving the Win box trying it's luck 'till morning. Changed mind.... just wasting network so turned CMS off altogether. Edit 2... From the latest update to the jobs plot maybe it isn't just me. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Correction, was running VBox 4.3.12 or w/e that came w/ BOINC on the Windows boxes; after having to restart two or three times each on three hosts in a couple of hours, went ahead and downloaded 5.0.10 on all of them. Only been running a half hour so far, but none have failed yet, which is an improvement... Trying other ideas on the Mac side, but of course it's "not the highest priority project" so it doesn't want work right now! :-/ |
©2024 CERN