Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1427 - Posted: 6 Nov 2015, 20:05:29 UTC
Last modified: 6 Nov 2015, 20:24:09 UTC

Upsurge in failed jobs in the CMS Jobs page is a bit worrying, but I haven't seen it in other data yet. I'll dig around at RAL.

[Later] Log data doesn't support the graph, I can only presume miscommunication between Condor and Dashboard. Failures in the last six hours, covering the blip:
tail 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out*|grep 'sh FINISHING'|sort -k 7 |grep -v 'status 0'
...
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 14:19:07 GMT 2015 on 217-481-30189 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 14:46:45 GMT 2015 on 219-351-30055 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:36:43 GMT 2015 on 6-673-31181 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:37:09 GMT 2015 on 285-635-1234 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:43:54 GMT 2015 on 251-716-20080 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 15:56:38 GMT 2015 on 6-662-8849 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:10:56 GMT 2015 on 6-654-10796 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:19:19 GMT 2015 on 306-782-14078 with (short) status 65 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:33:04 GMT 2015 on 246-777-17398 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:37:00 GMT 2015 on 195-291-12116 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:37:22 GMT 2015 on 261-554-11706 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:42:54 GMT 2015 on 306-782-14078 with (short) status 65 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:51:41 GMT 2015 on 217-481-30189 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:51:48 GMT 2015 on 246-472-26526 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:54:49 GMT 2015 on 219-351-20902 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:30:16 GMT 2015 on 217-336-49 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:32:07 GMT 2015 on 255-751-183 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:33:13 GMT 2015 on 286-636-30679 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 17:37:59 GMT 2015 on 55-506-15381 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:03:00 GMT 2015 on 275-614-10342 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:29:47 GMT 2015 on 219-351-20902 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 18:51:57 GMT 2015 on 11-281-12084 with (short) status 86 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:19:52 GMT 2015 on 217-708-20370 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:26:59 GMT 2015 on 217-481-30189 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 19:35:04 GMT 2015 on 275-614-10342 with (short) status 151 ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 20:12:52 GMT 2015 on 219-351-20902 with (short) status 151 ========


That's 4 or 5 an hour, not the 20+ being indicated.
ID: 1427 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1428 - Posted: 6 Nov 2015, 20:44:49 UTC - in response to Message 1418.  
Last modified: 6 Nov 2015, 20:47:58 UTC

A brief comment here, Bill. Background -- I've been delving into CMS code for 13 or 14 years now, so I know how a lot of it is structured. I spent a period with the performance group, benchmarking and optimising the slowest bits (I like to think my efforts led to Higgs discovery analysis finishing as much as 10 days earlier than otherwise; the reality is probably more like 1-2 days). I've also some experience in porting serial code to GPUs, specifically digital hologram reconstruction for a fantastic speedup -- tens of seconds per frame to tens of frames per second.
In my opinion CMS code doesn't lend itself well to parallelisation -- there's quite a bit of serial code that kills you under Amdahl's law. Some parts may well be good candidates for parallelisation, but by no means all.
The data structures also tend to work against efficient vectorisation. CMSSW is heavily object oriented and data structures tend to be vectors of C++ classes, which means access to data members is via offsets from a class iterator, whereas if data were contained in vectors of data members (as it were) then you just have an iterator down each data vector. One of our programming boffins once showed impressive speedups by refactoring data structures in this way, but it's not been widely adopted. If you do structure your data that way then it lends itself, IMHO, to better vectorisation/parallelisation.
ID: 1428 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 759
Credit: 11,762,329
RAC: 2,648
Message 1429 - Posted: 7 Nov 2015, 8:21:57 UTC

ID: 1429 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1430 - Posted: 7 Nov 2015, 12:24:27 UTC - in response to Message 1420.  

OK, now errors are showing in the logs. Investigating...
ID: 1430 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1433 - Posted: 7 Nov 2015, 19:44:27 UTC

Any progress on the high error rate, yet?
ID: 1433 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1436 - Posted: 7 Nov 2015, 23:09:31 UTC - in response to Message 1433.  
Last modified: 7 Nov 2015, 23:29:48 UTC

Any progress on the high error rate, yet?

No, I can't see anything obvious. Just checked my mail, responses from the experts means I need to do more checks. Obviously I'll let you know as soon as I know anything but it's the weekend, responses may be slow.
PS: Doctor Who was rather good tonight!

[Edit] Just passed a pertinent log file to the experts -- looks like a Cloud issue insofar as I'm understanding the jargon... [/Edit]
ID: 1436 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1454 - Posted: 14 Nov 2015, 1:44:44 UTC - in response to Message 1433.  

Any progress on the high error rate, yet?

I might have found something mid-afternoon Friday. You know the colour of the failures on the CMS Jobs graph? Well, that may be the colour of my face come tomorrow...
I've just submitted a new batch of 2000 "shorties" (25 events), which should get us through the weekend. I'm rather expecting the rate of errors to drop. You see, I idly did a disk check on the Condor server yesterday and found, to my horror, that there was no space left on the / partition. This is, of course, where /home resides and thus it houses all the log files our Condor jobs create...
So it's very likely that the post-processing failures that led to unnecessary retries were due to the postproc being unable to write the logs to disk!
I'd seen this danger several weeks ago, and Andrew had created a directory for me on another larger partition to copy logs to, pending a decision on whether we needed to keep them or not. So I'd moved a lot over and then got too complacent that we had adequate disk space and hadn't checked for a while. Needless to say, I spent some time moving a few GB of log files off the / partition... You live, you learn!
ID: 1454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 418
Message 1455 - Posted: 14 Nov 2015, 8:45:05 UTC
Last modified: 14 Nov 2015, 8:45:21 UTC

From this 2000 jobs 173 finished successfully so far, but what's looking strange
to me is that 146 are from the site T3_CH_Volunteer (alias BOINC) and 27 are from the site "unknown".
ID: 1455 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1456 - Posted: 14 Nov 2015, 12:02:58 UTC - in response to Message 1455.  

From this 2000 jobs 173 finished successfully so far, but what's looking strange
to me is that 146 are from the site T3_CH_Volunteer (alias BOINC) and 27 are from the site "unknown".

The jobs are good, my face is red but I'm not blue.
All the "unknowns" I've looked at on Dashboard have a CERN IP address; the first T3_CH_Volunteer I looked at was my Linux box at work! I'm just dashing over to the Condor site to see what else is common with the unknowns...
Hmm, T3_CH_Volunteer jobs are finishing as before with
FINISHING at Sat Nov 14 03:27:55 GMT 2015 on 9-22-5625
where the 9 is User-ID (me) and the 22 is Host-ID (my Linux box). The unknowns have
FINISHING at Sat Nov 14 03:21:25 GMT 2015 on jan-boinc-cms-0bbe4d55-1b84-4830-8d4a-64d45429fa15
FINISHING at Sat Nov 14 03:50:24 GMT 2015 on jan-boinc-cms-16dbe0c8-b17a-410e-a8e7-2c2c84363b7d
so it looks like I'll need to look for some other way of identifying failing hosts. :-(
Now we did make some changes the last two nights to the bootstrap to pick up our official SITECONFIG because CERN IT wanted to change the port we use for squid requests (to use a new 10 Gbps link) -- it's possible that that is bypassing the code we used to set the hostname but I'm uncertain as yet as to how that would link with a CERN address. I guess wait and see if the proportion of them increases as more volunteers start new 24-hour tasks.
ID: 1456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1457 - Posted: 14 Nov 2015, 19:19:34 UTC - in response to Message 1454.  

Some suggestion that it's not us filling up the server disk. I cleaned off >2.5 GB yesterday, but it's now back down to 400 MB free -- and we're only using 320 MB in total!
ID: 1457 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1458 - Posted: 15 Nov 2015, 2:38:12 UTC - in response to Message 1457.  
Last modified: 15 Nov 2015, 3:01:50 UTC

Some suggestion that it's not us filling up the server disk. I cleaned off >2.5 GB yesterday, but it's now back down to 400 MB free -- and we're only using 320 MB in total!
...which brings us back to the original topic...
We are going to run out of log-file space very soon. Obviously there is little chance that my emails to anyone responsible are going to be read at this hour on a Sunday morning. I've managed to move some other users' files to the backup partition, but without superuser status I can't move all. The "errant" jobs are still writing to disk faster than anything I can now delete. Sorry, but "expect errors imminently".

[Edit]
[cms005@lcggwms02:~] > df
Filesystem | 1K-blocks | Used | Available | Use% | Mounted on
/dev/sda2 | 10321208 | 9797036 | 0 | 100%| /

[/Edit]
:-(

ID: 1458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1459 - Posted: 15 Nov 2015, 17:17:54 UTC - in response to Message 1458.  

Have suspended tasks in progress - please let us know when we should resume!
ID: 1459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1460 - Posted: 15 Nov 2015, 18:04:39 UTC - in response to Message 1459.  

Have suspended tasks in progress - please let us know when we should resume!

Actually, it's fine now, Bill. Andrew spent some time on a Sunday morning moving other projects' logs to the backup partition -- for which I have thanked him on our behalf. Latest entries on the CMS Jobs charts show a good recovery.
It just adds another entry to the list of things we need to keep a sharp watch on when we ramp up to production.
I'm currently compressing the logs on the backup partition, to good effect. I wonder if there's a way to tell Condor to compress the logs directly?
ID: 1460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1461 - Posted: 15 Nov 2015, 20:11:40 UTC - in response to Message 1460.  
Last modified: 15 Nov 2015, 20:15:28 UTC

All resumed! :-)

edit: Except half of them say "Hypervisor failed to enter an online state in a timely manner" and are sleeping for a day, and of course the Mac STILL says "Virtualbox not installed"...
ID: 1461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rom Walton (BOINC)

Send message
Joined: 20 Mar 15
Posts: 14
Credit: 5,132
RAC: 0
Message 1462 - Posted: 15 Nov 2015, 20:39:32 UTC - in response to Message 1461.  

All resumed! :-)

edit: Except half of them say "Hypervisor failed to enter an online state in a timely manner" and are sleeping for a day, and of course the Mac STILL says "Virtualbox not installed"...


The latest Vboxwrapper should resolve that issue.

See: https://github.com/BOINC/boinc/releases/tag/vboxwrapper%2F26178

----- Rom
ID: 1462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1463 - Posted: 15 Nov 2015, 20:56:18 UTC - in response to Message 1462.  


The latest Vboxwrapper should resolve that issue.

See: https://github.com/BOINC/boinc/releases/tag/vboxwrapper%2F26178

----- Rom


Except that update was released on CMS Oct 22, the tasks I currently have running were sent Nov 11-14, gave error message today, and the Mac last gave the 'not installed' message this morning.

Win10 Home, latest BOINC. Mac OS X 10.11.1 "El Capitan", ditto. VBox 5.0.8 on all - just got notice of 5.0.10 for Mac, downloading it now.

If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what?
ID: 1463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 418
Message 1464 - Posted: 15 Nov 2015, 22:03:01 UTC - in response to Message 1463.  

If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what?

A simple restart of BOINC client will retry to start the CMS-task immediately.

Vboxwrapper version 26178 is already active.
ID: 1464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rom Walton (BOINC)

Send message
Joined: 20 Mar 15
Posts: 14
Credit: 5,132
RAC: 0
Message 1465 - Posted: 16 Nov 2015, 0:49:23 UTC - in response to Message 1464.  

If nothing else, can somebody change the Hypervisor timeout value from 24 hours to something more reasonable??? Or do I need to reset the project on all hosts to get the new wrapper, or what?

A simple restart of BOINC client will retry to start the CMS-task immediately.

Vboxwrapper version 26178 is already active.


Ah, okay, then http://boincai05.cern.ch/CMS-dev/result.php?resultid=66754 was just an old task.

----- Rom
ID: 1465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 28
Message 1466 - Posted: 16 Nov 2015, 2:11:20 UTC
Last modified: 16 Nov 2015, 2:59:40 UTC

Since restarting everything getting this error from glidein stderr:-

/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 101: /home/boinc/CMSRun/glidein_startup.sh: No such file or directory

I've reset the project on a couple of hosts (one Linux, one Win7) just to get
clean copies but the problem remains.

More logs...

cron stdout:-
02:38:09 +0000 2015-11-16 [INFO] Downloading glidein
02:39:01 +0000 2015-11-16 [INFO] Starting CMS Application - Run 3
02:39:01 +0000 2015-11-16 [INFO] Reading the BOINC volunteer's information
02:39:03 +0000 2015-11-16 [INFO] Volunteer: m (178) Host: 245
02:39:03 +0000 2015-11-16 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
02:39:03 +0000 2015-11-16 [INFO] Requesting an X509 credential from CMS-Dev
subject : /O=Volunteer Computing/O=CERN/CN=m 178/CN=1839075328
issuer : /O=Volunteer Computing/O=CERN/CN=m 178
identity : /O=Volunteer Computing/O=CERN/CN=m 178
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:34:00 (5.4 days)
02:39:09 +0000 2015-11-16 [INFO] Downloading glidein
02:39:21 +0000 2015-11-16 [INFO] Running glidein (check logs)
02:40:01 +0000 2015-11-16 [INFO] CMS glidein Run 3 ended
02:40:21 +0000 2015-11-16 [INFO] Running glidein (check logs)

and so on....

cron stderr:-
chmod: cannot access `/home/boinc/CMSRun/glidein_startup.sh': No such file or directory


Edit... off to bed leaving the Win box trying it's luck 'till morning.
Changed mind.... just wasting network so turned CMS off altogether.

Edit 2...
From the latest update to the jobs plot maybe it isn't just me.
ID: 1466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1467 - Posted: 16 Nov 2015, 2:21:38 UTC - in response to Message 1464.  

Correction, was running VBox 4.3.12 or w/e that came w/ BOINC on the Windows boxes; after having to restart two or three times each on three hosts in a couple of hours, went ahead and downloaded 5.0.10 on all of them. Only been running a half hour so far, but none have failed yet, which is an improvement...

Trying other ideas on the Mac side, but of course it's "not the highest priority project" so it doesn't want work right now! :-/
ID: 1467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN