41) Message boards : Number crunching : Feeder not running (Message 4439)
Posted 4 Dec 2016 by m
Post:


.....well except now I can't get a X2 task to start running and having the typical heartbeat problem.

Every time I get several to run in a row Valid this happens........good thing it is right in front of me.


This has been happening here overnight on LHC@home. If a job starts on it's own it works, but if anything else is happening (like a 15minute download of another image) jobs fail to start. It looks to be related to network access somehow. I was also trying to fit OS updates in as well... should have gone to bed. This problem seems to have generally got worse since the move to https and the LHC@home consolidation.
42) Message boards : CMS Application : New Version v47.60 (Message 4260)
Posted 30 Oct 2016 by m
Post:
Now I have one CMS task running, two Atlas tasks trying to upload and a LHCb task waiting to run. Why?
Tullio


Atlas has been missing since around 0300GMT. Even their website is down for me.

Edit:-
Back now, tho' @1545.
43) Message boards : Theory Application : Authentication errors - error 206. (Message 4246)
Posted 28 Oct 2016 by m
Post:
Several failures like this (or very similar, they don't all have all these lines) occurred overnight:-

2016-10-28 05:14:44 (2604): Guest Log: [INFO] Theory application starting. Check log files.
2016-10-28 05:14:54 (2604): Guest Log: [DEBUG] HTCondor ping
2016-10-28 05:15:24 (2604): Guest Log: [DEBUG] 1
2016-10-28 05:15:24 (2604): Guest Log: [DEBUG] DC_NOP failed!
2016-10-28 05:15:24 (2604): Guest Log: AUTHENTICATE:1003:Failed to authenticate with any method
2016-10-28 05:15:24 (2604): Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI
2016-10-28 05:15:24 (2604): Guest Log: GSI:5002:Failed to authenticate because the remote (server) side was not able to acquire its credentials.
2016-10-28 05:15:24 (2604): Guest Log: 10/28/16 05:14:44 recognized DC_NOP as command name, using command 60011.
2016-10-28 05:15:24 (2604): Guest Log: 10/28/16 05:15:20 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.
2016-10-28 05:15:24 (2604): VM Completion File Detected.
2016-10-28 05:15:24 (2604): VM Completion Message: Could not ping HTCondor.


The BOINC task fails:-

The filename or extension is too long.
(0xce) - exit code 206 (0xce)
44) Message boards : Theory Application : Unable to ping HTCondor (Message 4221)
Posted 23 Oct 2016 by m
Post:
OK well that must mean it is a 32bit-X86 OS and yeah they will not *see* any more Ram than the original 4GB

No, that's the capacity of the MB.

I left the other one running the old X86 XP Pro with just the 4GB Ram just because I want to see how long it will run (it is #1 on the vLHC stats page)

I can't get any of my 32bit XP hosts to run the Condor 32bit theory app. I've given up, at least for now; although I've looked enviously at yours...

BUT you are right about the ISP's we are forced to use with these VB tasks.

I have been running those tasks for almost 6 years now and the main problem has been VB failing because of my over-priced and usually pitiful DSL

Seen a couple of these now:-

2016-10-23 01:10:26 (2201): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125

Never seen them before so something's going on...

In fact I lost my connection for about 30mins tonight and lost a few VB tasks since they would not restart VB back to where it was when it lost that connection.

That's a very good point. Hosts here shut down and restart every day, this is at the root of a lot of odd failures with Condor based projects, I think.

Sometimes with mine it seems like the VB tasks are more of a way to test my DSL than it is with colliding particles.

Completely agree, but also feel that these projects could be better designed to allow for "consumer grade" setups (longer timeouts, compressed files, more allowance for network interruptions, low speeds etc).[rant] BOINC is, after all, designed to use "spare" capacity on existing consumer systems - not require 20 core hosts with 20Mb/s connections running 24/7. [/rant]. Even my systems run OK if I leave them running all the time, but too expensive.
45) Message boards : Theory Application : Unable to ping HTCondor (Message 4219)
Posted 22 Oct 2016 by m
Post:
There have been a couple of failures recently, like this. Presumably there is a timeout on this, and maybe it's exacerbated by "other" recent net activity; but can the timeout be increased, please?... one fewer cause of 206 errors.

I don't know anything about using Linux here but it says that pc has never contacted the server.

And just that one failed task.

Moving to the new server required detaching from and re-attaching to the project, this host ended up with a different host ID, the original ID was 266 - it's back there now. This also meant it got an (unwanted) task which I let run to see how it went. It doesn't usually run this project nowadays but should work.

Maybe try a VB update to a newer version (and reboot) and I always do a clean update just to make sure it works.

It doesn't (or didn't) seem broke... but you could be right; on the TODO list.

And you could add another 4GB of ram since yours is real close to not having enough to even run these VB tasks (find a good deal and you will like the way it runs just doing that)

I know, but that's all it will take. I originally hoped that BOINC would sort this out for itself ("waiting for memory") but it doesn't. So, as projects tend to increase their RAM needs over time (the original T4T was 256MB -happy days) those hosts with limited RAM now get app_config files to only allow the task combinations that will fit - a bit of trial and error there and it's a work in progress. But not really relevant to this.

You have a 2-core .......did you have any other type of Boinc task running or ever had it work with another type?

Shouldn't have, but maybe... and yes.

And what type of ISP (and do you have some type of firewall or security program running?)

The firewall is common to all the hosts - no problem there. The root cause is almost certainly caused by slow network response possibly ISP activity - they're always posting notices of overnight maintenance work of some sort, somewhere plus there were all those botnet members busy DDOSing Dyn or whoever. OpenDNS is set as third choice DNS (1st and 2nd are my ISP which may well have used Dyn) so it could have taken several seconds to get through. A longer timeout for the ping would make things a bit more tolerant of network delays.
46) Message boards : Theory Application : Unable to ping HTCondor (Message 4217)
Posted 22 Oct 2016 by m
Post:
There have been a couple of failures recently, like this. Presumably there is a timeout on this, and maybe it's exacerbated by "other" recent net activity; but can the timeout be increased, please?... one fewer cause of 206 errors.
47) Message boards : CMS Application : New Version v47.50 (Message 4197)
Posted 19 Oct 2016 by m
Post:
The 12/18 hr timeout still causes a fair bit of wasted time (and errors and frustration), but tasks behave oddly sometimes. In the course of shifting stuff to the new servers, I noticed this:-

CMS 47.50 (mt, 1 core, took 3gb)
Host# 553.
CMS Task# 274259.
From a very long stderr:-

Day/ Time
14/03.08.14 Wrapper start. (Boinc task started)
14/03.15.53 Job 6492 start.
14/06.15.33 Job finished with 0.
14/06.18.12 Job 7054 start.
14/06.57.02 VM stopped. (Normal host shutdown time)

15/01.04.44 Wrapper start. (Normal host startup time)
15/01.04.46 Job finished with 0.
15/01.04.46 Job 7054 start.
15/04.45.36 Job finished with 134.
15/04.45.46 Job 755 start.
15/06.57.02 VM stopped. (Normal shutdown time)

16/01.04.44 Wrapper start. (Normal startup time)
16/01.04.46 Job finished with 134.
16/01.04.46 Job 755 start.
16/06.12.17 Job finished with 0.
16/06.14.15 Job 5986 start.
16/06.57.02 VM stopped. (Normal shutdown time)

17/01.04.43 Wrapper start. (Normal startup time)
17/01.04.45 Job finished with 0.
17/01.04.45 Job 5986 start.
17/03.22.00 Boinc finish. (Boinc task finished OK)

Locating the jobs on Dashboard is a bit cumbersome; it's not obvious which CRAB
task they each belong to, but looks like 6492, 7054, 755 all finished OK, first
attempt. 5986 failed 61311 no second attempt.

Hosts are normally started by their BIOS clock and shutdown by a cron job using the boinccmd "quit" command. After 3 mins the host is powered off ('nuther cron job), so the process should be fairly graceful.
Total time ca 15h53. The shutdown time is clearly not counting towards the timeout - a welcome change, but the stopping and resuming doesn't seem to work well, what is exit code 134? with the job in hand restarted. The last job was cut short.

Is this how it's supposed to work? I seem to remember that the "production" standard was performance equal to the old T4T... I don't think we're quite there yet.
48) Message boards : Theory Application : A new 32bit image is available (Message 4158)
Posted 25 Sep 2016 by m
Post:
A screenshot, the cp command 11 lines down...

Edit:-
Using the old URL, http not https. (Thanks for the idea tho', maeax)
49) Message boards : Theory Application : A new 32bit image is available (Message 4155)
Posted 25 Sep 2016 by m
Post:
Fails to get x509 credential. The web server doesn't seem to be running so can't get to the logs.

The host is using a direct connection to CVMFS.
50) Message boards : Theory Application : A new 32bit image is available (Message 4154)
Posted 25 Sep 2016 by m
Post:
Ignore this one - so many fingers, all those keys...
51) Message boards : Theory Application : A new 32bit image is available (Message 4150)
Posted 23 Sep 2016 by m
Post:

Please can you send me the StartLog..

Done.
52) Message boards : Theory Application : A new 32bit image is available (Message 4149)
Posted 23 Sep 2016 by m
Post:

Starts OK but Condor exits after less than a minute.
.


It shouldn't do that. Please can you send me the StartLog. I have sent you my email address via PM.


Not easy since it terminates the BOINC task (successfully; credit, too!) I'll try to catch it out.
53) Message boards : Theory Application : A new 32bit image is available (Message 4145)
Posted 23 Sep 2016 by m
Post:
On my 64bit machine changed the os_name in the Theory_2016_04_30.xml from Linux26_64 to Linux26 so that a 32bit VM is started. I also replaced the Theory_2016_08_05.vdi image with the Theory32_2016_09_21.vdi which I manually downloaded. After doing this the new Theory task started and is running fine. Please post the results of any similar tests, especially if this is on 32bit hardware.

WinXP 32 BOINC 7.6.22 VBox 4.3.12 (1424)
Starts OK but Condor exits after less than a minute. Like this:-

2016-09-23 01:03:46 (620): Guest Log: [INFO] Theory application starting. Check log files.
2016-09-23 01:04:26 (620): Guest Log: [INFO] Condor exited with return value N/A.
2016-09-23 01:04:26 (620): Guest Log: [INFO] Shutting Down.
2016-09-23 01:04:26 (620): VM Completion File Detected.
2016-09-23 01:04:26 (620): VM Completion Message: Condor exited with return value N/A.

.
54) Message boards : Theory Application : A new 32bit image is available (Message 4132)
Posted 19 Sep 2016 by m
Post:
Task with vboxwrapper 26178 had very low cpu.


This was the first test under https.
Task finished with Cobblestones.

https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=260038


I get similar errors to yours:-

2016-09-18 12:13:55 (1900): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2016-09-18 12:14:05 (1900): Guest Log: [DEBUG] nc: getaddrinfo: Temporary failure in name resolution
2016-09-18 12:14:05 (1900): Guest Log: [DEBUG] 1

and:-

2016-09-18 12:17:37 (1900): Guest Log: [ERROR] Could not get an x509 credential

I don't think that any useful work is done.

Edit:- It would be nice to have one that runs on 64 bit hosts without having to use AP.
55) Message boards : CMS Application : Increased failure rate "error 134 - Aborted" (Message 4105)
Posted 29 Aug 2016 by m
Post:
I'd suspect you were having transient network problems around the times these happened.

Well, it has been known... I'll keep an eye out. Thanks again.

Just checked, ADSL profile gone down from 7.96M.. 7.08M... 6.45M over the last few days. Maybe you're right.
56) Message boards : CMS Application : Increased failure rate "error 134 - Aborted" (Message 4103)
Posted 29 Aug 2016 by m
Post:
Can you point me to a task log or a job number?


920, 1394, 1361, 1325...

Did about 17 jobs last night, one (7215) failed 10034, so it seems OK now. I'm searching these by IP, don't know how else to do it, so they might have been finished elsewhere. (Still troubled by the 12/18 hour timeout...)

Thanks for taking a look, Ivan.
57) Message boards : CMS Application : Increased failure rate "error 134 - Aborted" (Message 4099)
Posted 28 Aug 2016 by m
Post:
My failure rate "error 134" has increased by a factor of maybe 10 for the
task ....0046_Q compared to task .....0046_P. At least, that's what it looks
like from my reading of Dashboard and making no allowance for the small
numbers... or anything I've done.

None of the jobs were sent for a re-try so I don't know if they would have
failed on another host, nor if a particlar host is the culprit.

Any ideas?
58) Message boards : Theory Application : A new 32bit image is available (Message 4097)
Posted 28 Aug 2016 by m
Post:
I have reverted the vboxwrapper back to the old one (26178) so that it doesn't request hardware acceleration.


If I try it on a 64 bit host, "as is" this happens

28/08/2016 01:15:52 | vLHCathome-dev | Message from server: VirtualBox jobs require hardware acceleration support. Your processor does not support the required instruction set.

BOINC 7.6.22, VB 4.3.26.

....and so to bed.
59) Message boards : Theory Application : A new 32bit image is available (Message 4096)
Posted 27 Aug 2016 by m
Post:
Doesn't work for me on a 32 bit host (BOINC 7.6.22; VB 4.3.12)

When checking access to the three CERN sites, gets "Temporary failure of name resolution" for all three checks.

Subsequently when trying to get X509 credential, fails

'chmod: cannot access '/tmp/x509up_u0' : No such file or directory , repeated many times, eventually failing

[ERROR] Could not get an x509 credential
/root/bootstrap: line 130: /usr/sbin/boinc-shutdown: Input/output error
60) Message boards : CMS Application : CMS versions aborts (Message 3951)
Posted 4 Aug 2016 by m
Post:
Had a couple of these recently:-

Stderr output

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
The filename or extension is too long.
(0xce) - exit code 206 (0xce)
</message>
<stderr_txt>
....
....
2016-07-31 01:25:00 (3384): Guest Log: [ERROR] Could not get an x509 credential
2016-07-31 01:25:10 (3384): Guest Log: [ERROR] The x509 proxy creation failed.
2016-07-31 01:25:10 (3384): Guest Log: [INFO] Shutting Down.
2016-07-31 01:25:10 (3384): VM Completion File Detected.
2016-07-31 01:25:10 (3384): VM Completion Message: The x509 proxy creation failed.
.
2016-07-31 01:25:10 (3384): Powering off VM.


The host is one of those connecting directly to cvmfs.


Previous 20 · Next 20


©2024 CERN