Message boards : Number crunching : glidein failures
Message board moderation

To post messages, you must log in.

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2242 - Posted: 5 Mar 2016, 2:49:34 UTC
Last modified: 5 Mar 2016, 3:03:16 UTC

Glidein is failing to load various files, e.g

file_list.ebsf2L.lst: OK
Signature OK for main:file_list.ebsf2L.lst.
error_gen.ebsdRw.sh: OK
Signature OK for main:error_gen.ebsdRw.sh.
Sat Mar 5 02:00:44 GMT 2016 Failed to load file 'error_augment.ebsdRw.sh' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.
Sat Mar 5 02:00:45 GMT 2016 Sleeping 335


another one

description.ebeeXx.cfg: OK
Signature OK for client:description.ebeeXx.cfg.
description.ebeeXx.cfg: OK
Signature OK for client_group:description.ebeeXx.cfg.
Sat Mar 5 02:26:35 GMT 2016 Failed to load file 'file_list.ebsf2L.lst' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.
Sat Mar 5 02:26:36 GMT 2016 Sleeping 345
Sat Mar 5 02:32:21 GMT 2016 Sleeping 314
Sat Mar 5 02:37:35 GMT 2016 Sleeping 250
Sat Mar 5 02:41:48 GMT 2016 Sleeping 284



This has been occurring with various files on various hosts. Once a file has failed to load that glidein run never recovers. The next run often fails a different file. After a few runs one succeeds in getting all the files and off we go.
Now it looks as though the failures are "nonzero exits" and the task restarts, often more than once. There doesn't seem to be any indication of why the downloads fail. Should this be in a log somewhere? Do I need to poke a hole in the firewall for this or should there be a keep alive keeping the port open?
ID: 2242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,944,473
RAC: 3,018
Message 2247 - Posted: 5 Mar 2016, 10:30:16 UTC - in response to Message 2242.  

Glidein is failing to load various files, e.g

Sat Mar 5 02:00:44 GMT 2016 Failed to load file 'error_augment.ebsdRw.sh' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.

That looks like a network problem, or a (transient?) problem at RAL. I'll alert Andrew, but he's been in Spain all week so I daresay he's got better things to do with his weekend. :-)
ID: 2247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2249 - Posted: 5 Mar 2016, 17:32:28 UTC - in response to Message 2247.  
Last modified: 5 Mar 2016, 17:52:08 UTC

Glidein is failing to load various files, e.g

Sat Mar 5 02:00:44 GMT 2016 Failed to load file 'error_augment.ebsdRw.sh' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.

That looks like a network problem, or a (transient?) problem at RAL. I'll alert Andrew, but he's been in Spain all week so I daresay he's got better things to do with his weekend. :-)

It isn't a new problem. Transient but frequent and affects several hosts. Previously it simply delayed job startup - which is how it came to light, now causes the task to restart - very nearly preventing any useful work. Definitely transient, and only seems to affect these files. It looks as though my firewall (which is common to the whole LAN and could easily be the culprit) closes an incoming port when some timeout or other expires... but which port? and why only these files?

This is what it looks like in stderr

2016-03-05 06:15:20 (25107): Detected: Web Application Enabled (http://localhost:39333)
2016-03-05 06:15:20 (25107): Detected: Remote Desktop Enabled (localhost:49419)
2016-03-05 06:15:20 (25107): Preference change detected
2016-03-05 06:15:20 (25107): Setting CPU throttle for VM. (100%)
2016-03-05 06:15:21 (25107): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 600 seconds) or (Vbox_job.xml: 600 seconds))
2016-03-05 06:15:22 (25107): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2016-03-05 06:15:22 (25107): Guest Log: BIOS: Booting from Hard Disk...
2016-03-05 06:15:25 (25107): Guest Log: BIOS: KBD: unsupported int 16h function 03
2016-03-05 06:15:25 (25107): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2016-03-05 06:15:57 (25107): Guest Log: vboxguest: major 0, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2016-03-05 06:16:16 (25107): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff8800761cce10), OR(0x0), NOT(0xffffffff), flags(0x0)
2016-03-05 06:16:16 (25107): Guest Log: VBoxGuest: VBoxGuestCommonGuestCapsAcquire: pSession(0xffff8800761cca10), OR(0x0), NOT(0xffffffff), flags(0x0)
2016-03-05 06:46:01 (25107): VM Completion File Detected.
2016-03-05 06:46:01 (25107): Powering off VM.
2016-03-05 06:46:02 (25107): Successfully stopped VM.
2016-03-05 06:46:02 (25107): Deregistering VM. (boinc_2f5baf1e8d65dce5, slot#1)
2016-03-05 06:46:02 (25107): Removing network bandwidth throttle group from VM.


They don't run for long... and don't do any work.
ID: 2249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,944,473
RAC: 3,018
Message 2250 - Posted: 5 Mar 2016, 18:48:45 UTC - in response to Message 2249.  

Thanks for the extra info. Let's hope an expert can come up with a suggestion soon!
ID: 2250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2251 - Posted: 5 Mar 2016, 18:55:22 UTC

I guess, you have tried the standard procedure(powering down modem and router)?
Maybe clearing the dns cache?
ID: 2251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2252 - Posted: 5 Mar 2016, 19:29:57 UTC - in response to Message 2251.  
Last modified: 5 Mar 2016, 19:40:01 UTC

I guess, you have tried the standard procedure(powering down modem and router)?
Maybe clearing the dns cache?


Thanks for the thought but yes; I'm not usually keen on the "turn it off and on again" but I don't know any other (hopefully) certain way to clear the router's DNS cache and the modem (and much else) is on the same power lead so got done by default. The two hosts (1031 and 1033) from which I copied the logs are both Linux, no DNS cache there. It also affects Windows hosts but they're not running CMS at the moment. I've tried repeatedly running nslookup to the RAL url but it's never failed. If it were at RAL everybody would see it although it's always a file of the set from that URL.

Edit. Your reply set off some thoughts, different boxes have got different
combinations of ISP and OpenDNS servers set. Perhaps I'll set something like Google's DNS on all, just for the moment, to see if anything changes. It won't be the first time that an ISP's DNS has been a bit questionable.
ID: 2252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2257 - Posted: 6 Mar 2016, 1:16:29 UTC
Last modified: 6 Mar 2016, 1:31:14 UTC

The two Linux boxes have each started new BOINC tasks without problems on original DNS servers.
Host 1031 started cmsRun at 20 min and event 1 at 27 min. The cmsRun time is about normal for me but event 1 has been closer to 40 min. ADSL down speed is ~5.3Mb/s.
Host 1033 started cmsRun at 13 min and event 1 at 17 min. I've never seen anything that fast before and no, the ADSL hasn't miraculously speeded up.
Similar hardware and software in each case.

Off to bed while they get some real work done... I hope such good results as we do produce are being put to good use. They are... aren't they?
ID: 2257 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2263 - Posted: 8 Mar 2016, 3:02:13 UTC
Last modified: 8 Mar 2016, 3:04:41 UTC

Another pair:-

Setting X509_USER_PROXY to canonical path /tmp/x509up_u500
signature.ebsf2L.sha1: OK
signature.g2ifBz.sha1: OK
signature.ebeeXx.sha1: OK
signature.ebeeXx.sha1: OK
description.ebsf2L.cfg: OK
Signature OK for main:description.ebsf2L.cfg.
description.g2ifBz.cfg: OK
Signature OK for entry:description.g2ifBz.cfg.
description.ebeeXx.cfg: OK
Signature OK for client:description.ebeeXx.cfg.
description.ebeeXx.cfg: OK
Signature OK for client_group:description.ebeeXx.cfg.


-- ~62"OK" lines snipped --

Signature OK for client_group:preentry_file_list.ebeeXx.lst.
constants.ebeeXx.cfg: OK
Signature OK for client_group:constants.ebeeXx.cfg.
condor_vars.eb89he.lst: OK
Signature OK for client_group:condor_vars.eb89he.lst.
untar.eb89he.cfg: OK
Signature OK for client_group:untar.eb89he.cfg.
No signature for /home/boinc/CMSRun/glide_QUJwyt/client_group_main/nodes.blacklist.
cat_consts.eb89he.sh: OK
Signature OK for client_group:cat_consts.eb89he.sh.
check_blacklist.eb89he.sh: OK
Signature OK for client_group:check_blacklist.eb89he.sh.
aftergroup_preentry_file_list.eb89he.lst: OK
Signature OK for client:aftergroup_preentry_file_list.eb89he.lst.
file_list.g2ifBz.lst: OK
Signature OK for entry:file_list.g2ifBz.lst.
constants.g2ifBz.cfg: OK
Signature OK for entry:constants.g2ifBz.cfg.
Tue Mar 8 01:19:14 GMT 2016 Failed to load file 'condor_vars.g2ifBz.lst' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7/entry_volunteer'.
Tue Mar 8 01:19:16 GMT 2016 Sleeping 327


and

Setting X509_USER_PROXY to canonical path /tmp/x509up_u500
signature.ebsf2L.sha1: OK
signature.g2ifBz.sha1: OK
signature.ebeeXx.sha1: OK
signature.ebeeXx.sha1: OK
description.ebsf2L.cfg: OK
Signature OK for main:description.ebsf2L.cfg.
description.g2ifBz.cfg: OK
Signature OK for entry:description.g2ifBz.cfg.
description.ebeeXx.cfg: OK
Signature OK for client:description.ebeeXx.cfg.
description.ebeeXx.cfg: OK
Signature OK for client_group:description.ebeeXx.cfg.
file_list.ebsf2L.lst: OK
Signature OK for main:file_list.ebsf2L.lst.
error_gen.ebsdRw.sh: OK
Signature OK for main:error_gen.ebsdRw.sh.
error_augment.ebsdRw.sh: OK
Signature OK for main:error_augment.ebsdRw.sh.
setup_script.ebsdRw.sh: OK
Signature OK for main:setup_script.ebsdRw.sh.
constants.ebsf2L.cfg: OK
Signature OK for main:constants.ebsf2L.cfg.
condor_vars.ebsdRw.lst: OK
Signature OK for main:condor_vars.ebsdRw.lst.
untar.ebsdRw.cfg: OK


-- ~60 "OK" lines snipped --

Signature OK for client_group:cat_consts.eb89he.sh.
check_blacklist.eb89he.sh: OK
Signature OK for client_group:check_blacklist.eb89he.sh.
aftergroup_preentry_file_list.eb89he.lst: OK
Signature OK for client:aftergroup_preentry_file_list.eb89he.lst.
file_list.g2ifBz.lst: OK
Signature OK for entry:file_list.g2ifBz.lst.
constants.g2ifBz.cfg: OK
Signature OK for entry:constants.g2ifBz.cfg.
condor_vars.g2ifBz.lst: OK
Signature OK for entry:condor_vars.g2ifBz.lst.
untar.f6ievT.cfg: OK
Signature OK for entry:untar.f6ievT.cfg.
Tue Mar 8 01:46:17 GMT 2016 Failed to load file 'nodes.blacklist' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7/entry_volunteer'.
Tue Mar 8 01:46:18 GMT 2016 Sleeping 296


There follow several "sleeping" entries but nothing to indicate that glidein tries again to load the file and the run eventually fails.

These are successive runs from the same host (1031). The next one made it to the end and job 7854 is running as I write.

The only idea I have at the moment is that maybe the large number of successive downloads - there were other hosts running ATLAS etc. - somehow triggers the "TCP flood" detector in the firewall, although there's nothing in the log (such as it is). This is intended to prevent DOS attacks. It's not clear if it only applies to incoming or to both. It can be turned on or off for TCP and/or UDP. I've turned off TCP to see if there's any effect although I wouldn't be happy with this as a permanent arrangement.
ID: 2263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 2368 - Posted: 12 Mar 2016, 12:08:12 UTC

I've now been through all the "odd" firewall settings without success. The "failed to load" failures are still occurring. Since the hosts are off during the day and this far exceeds whatever suspend/resume timing works at the moment they always start a new job soon after booting. I'm not sure how this affects things. In any event it will contaminate any baseline run you want to do so I'll give up for now.
ID: 2368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2369 - Posted: 12 Mar 2016, 12:41:56 UTC

It sounds more and more like a hardware problem.
Noisy powers supplies can cause all sorts of strange behavior.
I would try a different router and/or modem, if you can.


Is there anybody out there, who has experienced anything like it?

Please post!
ID: 2369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 3164 - Posted: 1 May 2016, 22:23:15 UTC - in response to Message 2369.  
Last modified: 1 May 2016, 23:10:00 UTC

It sounds more and more like a hardware problem.
Noisy powers supplies can cause all sorts of strange behavior.
I would try a different router and/or modem, if you can

Well, it was, or is and I did, or am.
Are you sitting comfortably? The story so far...
After a lot of swapping spare kit around, it wasn't the power supplies nor the modem. But using the spare router fixed it. However the spare router is twice the size of the regular one and won't fit in the space. So I got another router of the same type as the original. The problem returned. It would seem that the problem is somehow the result of the way the regular router (Netgear FVS114) handles things which must be different from how the spare (Netgear FVS318v3) works. So at the moment things are still working with the spare router (together with it's power supply and the modem and it's power supply) balanced on the edge of the bench in grave danger of me catching a foot in a loop of cable and bringing the whole edifice crashing to the ground. The way jobs start has somehow been changed now so I plan to quietly try the original router when nobody's looking and hope things work.
Stay tuned for the next thrilling episode...
ID: 3164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : glidein failures


©2024 CERN