Message boards :
Number crunching :
glidein failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Glidein is failing to load various files, e.g file_list.ebsf2L.lst: OK Signature OK for main:file_list.ebsf2L.lst. error_gen.ebsdRw.sh: OK Signature OK for main:error_gen.ebsdRw.sh. Sat Mar 5 02:00:44 GMT 2016 Failed to load file 'error_augment.ebsdRw.sh' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'. Sat Mar 5 02:00:45 GMT 2016 Sleeping 335 another one description.ebeeXx.cfg: OK Signature OK for client:description.ebeeXx.cfg. description.ebeeXx.cfg: OK Signature OK for client_group:description.ebeeXx.cfg. Sat Mar 5 02:26:35 GMT 2016 Failed to load file 'file_list.ebsf2L.lst' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'. Sat Mar 5 02:26:36 GMT 2016 Sleeping 345 Sat Mar 5 02:32:21 GMT 2016 Sleeping 314 Sat Mar 5 02:37:35 GMT 2016 Sleeping 250 Sat Mar 5 02:41:48 GMT 2016 Sleeping 284 This has been occurring with various files on various hosts. Once a file has failed to load that glidein run never recovers. The next run often fails a different file. After a few runs one succeeds in getting all the files and off we go. Now it looks as though the failures are "nonzero exits" and the task restarts, often more than once. There doesn't seem to be any indication of why the downloads fail. Should this be in a log somewhere? Do I need to poke a hole in the firewall for this or should there be a keep alive keeping the port open? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Glidein is failing to load various files, e.g That looks like a network problem, or a (transient?) problem at RAL. I'll alert Andrew, but he's been in Spain all week so I daresay he's got better things to do with his weekend. :-) |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Glidein is failing to load various files, e.g It isn't a new problem. Transient but frequent and affects several hosts. Previously it simply delayed job startup - which is how it came to light, now causes the task to restart - very nearly preventing any useful work. Definitely transient, and only seems to affect these files. It looks as though my firewall (which is common to the whole LAN and could easily be the culprit) closes an incoming port when some timeout or other expires... but which port? and why only these files? This is what it looks like in stderr 2016-03-05 06:15:20 (25107): Detected: Web Application Enabled (http://localhost:39333) 2016-03-05 06:15:20 (25107): Detected: Remote Desktop Enabled (localhost:49419) 2016-03-05 06:15:20 (25107): Preference change detected 2016-03-05 06:15:20 (25107): Setting CPU throttle for VM. (100%) 2016-03-05 06:15:21 (25107): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds)) 2016-03-05 06:15:22 (25107): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2016-03-05 06:15:22 (25107): Guest Log: BIOS: Booting from Hard Disk... 2016-03-05 06:15:25 (25107): Guest Log: BIOS: KBD: unsupported int 16h function 03 2016-03-05 06:15:25 (25107): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2016-03-05 06:15:57 (25107): Guest Log: vboxguest: major 0, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000) 2016-03-05 06:16:16 (25107): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff8800761cce10), OR(0x0), NOT(0xffffffff), flags(0x0) 2016-03-05 06:16:16 (25107): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff8800761cca10), OR(0x0), NOT(0xffffffff), flags(0x0) 2016-03-05 06:46:01 (25107): VM Completion File Detected. 2016-03-05 06:46:01 (25107): Powering off VM. 2016-03-05 06:46:02 (25107): Successfully stopped VM. 2016-03-05 06:46:02 (25107): Deregistering VM. (boinc_2f5baf1e8d65dce5, slot#1) 2016-03-05 06:46:02 (25107): Removing network bandwidth throttle group from VM. They don't run for long... and don't do any work. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Thanks for the extra info. Let's hope an expert can come up with a suggestion soon! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I guess, you have tried the standard procedure(powering down modem and router)? Maybe clearing the dns cache? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I guess, you have tried the standard procedure(powering down modem and router)? Thanks for the thought but yes; I'm not usually keen on the "turn it off and on again" but I don't know any other (hopefully) certain way to clear the router's DNS cache and the modem (and much else) is on the same power lead so got done by default. The two hosts (1031 and 1033) from which I copied the logs are both Linux, no DNS cache there. It also affects Windows hosts but they're not running CMS at the moment. I've tried repeatedly running nslookup to the RAL url but it's never failed. If it were at RAL everybody would see it although it's always a file of the set from that URL. Edit. Your reply set off some thoughts, different boxes have got different combinations of ISP and OpenDNS servers set. Perhaps I'll set something like Google's DNS on all, just for the moment, to see if anything changes. It won't be the first time that an ISP's DNS has been a bit questionable. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
The two Linux boxes have each started new BOINC tasks without problems on original DNS servers. Host 1031 started cmsRun at 20 min and event 1 at 27 min. The cmsRun time is about normal for me but event 1 has been closer to 40 min. ADSL down speed is ~5.3Mb/s. Host 1033 started cmsRun at 13 min and event 1 at 17 min. I've never seen anything that fast before and no, the ADSL hasn't miraculously speeded up. Similar hardware and software in each case. Off to bed while they get some real work done... I hope such good results as we do produce are being put to good use. They are... aren't they? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Another pair:- Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 signature.ebsf2L.sha1: OK signature.g2ifBz.sha1: OK signature.ebeeXx.sha1: OK signature.ebeeXx.sha1: OK description.ebsf2L.cfg: OK Signature OK for main:description.ebsf2L.cfg. description.g2ifBz.cfg: OK Signature OK for entry:description.g2ifBz.cfg. description.ebeeXx.cfg: OK Signature OK for client:description.ebeeXx.cfg. description.ebeeXx.cfg: OK Signature OK for client_group:description.ebeeXx.cfg. -- ~62"OK" lines snipped -- Signature OK for client_group:preentry_file_list.ebeeXx.lst. constants.ebeeXx.cfg: OK Signature OK for client_group:constants.ebeeXx.cfg. condor_vars.eb89he.lst: OK Signature OK for client_group:condor_vars.eb89he.lst. untar.eb89he.cfg: OK Signature OK for client_group:untar.eb89he.cfg. No signature for /home/boinc/CMSRun/glide_QUJwyt/client_group_main/nodes.blacklist. cat_consts.eb89he.sh: OK Signature OK for client_group:cat_consts.eb89he.sh. check_blacklist.eb89he.sh: OK Signature OK for client_group:check_blacklist.eb89he.sh. aftergroup_preentry_file_list.eb89he.lst: OK Signature OK for client:aftergroup_preentry_file_list.eb89he.lst. file_list.g2ifBz.lst: OK Signature OK for entry:file_list.g2ifBz.lst. constants.g2ifBz.cfg: OK Signature OK for entry:constants.g2ifBz.cfg. Tue Mar 8 01:19:14 GMT 2016 Failed to load file 'condor_vars.g2ifBz.lst' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7/entry_volunteer'. Tue Mar 8 01:19:16 GMT 2016 Sleeping 327 and Setting X509_USER_PROXY to canonical path /tmp/x509up_u500 signature.ebsf2L.sha1: OK signature.g2ifBz.sha1: OK signature.ebeeXx.sha1: OK signature.ebeeXx.sha1: OK description.ebsf2L.cfg: OK Signature OK for main:description.ebsf2L.cfg. description.g2ifBz.cfg: OK Signature OK for entry:description.g2ifBz.cfg. description.ebeeXx.cfg: OK Signature OK for client:description.ebeeXx.cfg. description.ebeeXx.cfg: OK Signature OK for client_group:description.ebeeXx.cfg. file_list.ebsf2L.lst: OK Signature OK for main:file_list.ebsf2L.lst. error_gen.ebsdRw.sh: OK Signature OK for main:error_gen.ebsdRw.sh. error_augment.ebsdRw.sh: OK Signature OK for main:error_augment.ebsdRw.sh. setup_script.ebsdRw.sh: OK Signature OK for main:setup_script.ebsdRw.sh. constants.ebsf2L.cfg: OK Signature OK for main:constants.ebsf2L.cfg. condor_vars.ebsdRw.lst: OK Signature OK for main:condor_vars.ebsdRw.lst. untar.ebsdRw.cfg: OK -- ~60 "OK" lines snipped -- Signature OK for client_group:cat_consts.eb89he.sh. check_blacklist.eb89he.sh: OK Signature OK for client_group:check_blacklist.eb89he.sh. aftergroup_preentry_file_list.eb89he.lst: OK Signature OK for client:aftergroup_preentry_file_list.eb89he.lst. file_list.g2ifBz.lst: OK Signature OK for entry:file_list.g2ifBz.lst. constants.g2ifBz.cfg: OK Signature OK for entry:constants.g2ifBz.cfg. condor_vars.g2ifBz.lst: OK Signature OK for entry:condor_vars.g2ifBz.lst. untar.f6ievT.cfg: OK Signature OK for entry:untar.f6ievT.cfg. Tue Mar 8 01:46:17 GMT 2016 Failed to load file 'nodes.blacklist' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7/entry_volunteer'. Tue Mar 8 01:46:18 GMT 2016 Sleeping 296 There follow several "sleeping" entries but nothing to indicate that glidein tries again to load the file and the run eventually fails. These are successive runs from the same host (1031). The next one made it to the end and job 7854 is running as I write. The only idea I have at the moment is that maybe the large number of successive downloads - there were other hosts running ATLAS etc. - somehow triggers the "TCP flood" detector in the firewall, although there's nothing in the log (such as it is). This is intended to prevent DOS attacks. It's not clear if it only applies to incoming or to both. It can be turned on or off for TCP and/or UDP. I've turned off TCP to see if there's any effect although I wouldn't be happy with this as a permanent arrangement. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I've now been through all the "odd" firewall settings without success. The "failed to load" failures are still occurring. Since the hosts are off during the day and this far exceeds whatever suspend/resume timing works at the moment they always start a new job soon after booting. I'm not sure how this affects things. In any event it will contaminate any baseline run you want to do so I'll give up for now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It sounds more and more like a hardware problem. Noisy powers supplies can cause all sorts of strange behavior. I would try a different router and/or modem, if you can. Is there anybody out there, who has experienced anything like it? Please post! |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
It sounds more and more like a hardware problem. Well, it was, or is and I did, or am. Are you sitting comfortably? The story so far... After a lot of swapping spare kit around, it wasn't the power supplies nor the modem. But using the spare router fixed it. However the spare router is twice the size of the regular one and won't fit in the space. So I got another router of the same type as the original. The problem returned. It would seem that the problem is somehow the result of the way the regular router (Netgear FVS114) handles things which must be different from how the spare (Netgear FVS318v3) works. So at the moment things are still working with the spare router (together with it's power supply and the modem and it's power supply) balanced on the edge of the bench in grave danger of me catching a foot in a loop of cable and bringing the whole edifice crashing to the ground. The way jobs start has somehow been changed now so I plan to quietly try the original router when nobody's looking and hope things work. Stay tuned for the next thrilling episode... |
©2025 CERN