41) Message boards : Theory Application : New Version (v3.14) (Message 6418)
Posted 3 Jul 2019 by Profile Ray Murray
Post:
Hi CP
Could you describe your "manual intervention". I have tried the edit of elapsed time in checkpoint but that method doesn't work with these. I see "completion file detected" in the log of those tasks. What should I create and where should I put it?
42) Message boards : Theory Application : New Version (v3.14) (Message 6414)
Posted 2 Jul 2019 by Profile Ray Murray
Post:
I let a 3.14 run long enough to start a job then did the checkpoint edit to see how it would react. Same -240 file xfer error as before so I don't know how these will do if they are left to run to term.
Oldest one is 4hrs in, running its 3rd job. Others are around the 1hr mark, running their 1st or 2nd jobs so start-up OK and intermediate, Job-completion uploads OK but still possible doubt as to TASK completion.
43) Message boards : Theory Application : New Version (v3.13) (Message 6412)
Posted 2 Jul 2019 by Profile Ray Murray
Post:
All of my 3.13s happily returned McPlots but ALL ended with

upload failure: <file_xfer_error>
<file_name>Theory_2279-772328-75_0_r1496801322_result</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>

… on completion, including 2 that I tried to "end gracefully" with the checkpoint edit. 2 others still running near completion but I'm not hopeful. Couple of 3.14s running, but not far in, so we'll see how they fair.
44) Message boards : Theory Application : New Version (v3.13) (Message 6405)
Posted 1 Jul 2019 by Profile Ray Murray
Post:
Ah, so it's not a "NEW" version, updating 4.xx, it's a new image for the "old" 3.xx version (263.90 on main site) using Theory_2017_05_29 xml not Theory_2019_02_20 therefore not using cranky.
I get confused easily with the apps using a different numbering system here from what they do when they are moved over to Production.

Anyway, the ones I have all seem to be running fine although I wasn't getting the X509 problem that others were having so might be better to wait for somebody that WAS having that problem to report whether the issue is fixed.
45) Message boards : Theory Application : New Version (v3.13) (Message 6404)
Posted 1 Jul 2019 by Profile Ray Murray
Post:
And it now runs multiple consecutive Jobs within the Task, as on the Production site rather than just a single Job, with Alt - F1 - 5 screens now available as well rather than being straight in to F2.
I'm going to guess at a similar 12hr initial limit, finishing after any Job in progress at that time completes. I've not got any that far in yet. I'm hoping that there is also a fix to credit-for-work-done any Tasks that are terminated at the 18hr cut-off. I'll wait and watch any healthy ones that get that far before extending that limit.
46) Message boards : Theory Application : Windows Version (Message 6395)
Posted 6 Jun 2019 by Profile Ray Murray
Post:
Still getting 3 or 4 a day of the "busy doing nothing" ones (per CP's screenshots) that seem to get stuck between setting up and processing then spend c.2hrs apparently doing nothing, but using a whole core to do that nothing, before ending in error for no credits. I have been Aborting any of these that I catch but have been wondering if there is anything returned that might isolate WHY they do this? If they return something useful, I will let them run but if not, I will continue to Abort them.

I have extended the time limit to allow healthy-looking, long runners a chance to complete but I would be interested to learn if anyone has found a way to "gracefully end" loopers or those that will clearly end in disk-limit errors so that at least some credits are awarded for the time, rather than none at all. (repeating from my last post but I don't know how to do it myself, or if it is even possible since there now appears to be less reliance on writing stuff to/from Boinc, out-with the VM itself.)
47) Message boards : Theory Application : Windows Version (Message 6320)
Posted 2 May 2019 by Profile Ray Murray
Post:
Is there any way to "end gracefully" faulty jobs such as CP's screenshot (of which I seem to get quite a lot) or ones that are clearly going to Exceed disk limit (not so many), like one can with standard production VMs by editing the checkpoint to fool it into thinking it has run to term? I've tried that here and it doesn't work. Is the only option to Abort said tasks or let them error out on their own?
48) Message boards : Theory Application : Windows Version (Message 6298)
Posted 21 Apr 2019 by Profile Ray Murray
Post:
At the rate the console is whizzing by, I suspect this one would have been an Exceeded-disk-limit failure had it been in a Production VM but as don't know where the corresponding logs are kept in the VM here, I can't tell for sure so I'm letting it run for now but I'm not hopeful for its outcome.
echo "runspec=boinc pp jets 7000 170,-,2960 - sherpa 2.1.1 default 38000 46"
echo "run=pp jets 7000 170,-,2960 - sherpa 2.1.1 default"
echo "jobid=49639792"
echo "revision=2279"
echo "runid=772896"

I'm also getting 10 or more a day, across my 3 machines, of the c.2hr Unknown Error / Incorrect Function failures as previously detailed by CP.
49) Message boards : Theory Application : Windows Version (Message 6296)
Posted 19 Apr 2019 by Profile Ray Murray
Post:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2769602 and
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2769692 ran for more than 19hrs each but threw -240, upload failure: <file_xfer_error>, presumably because a job was still running when the time limit kicked in so there was no completion file to upload. With "ordinary" VMs that Job would simply be lost but the Task would still be credited.
50) Message boards : Theory Application : Windows Version (Message 6295)
Posted 19 Apr 2019 by Profile Ray Murray
Post:
I've had a few of these, too, with the exact same screen output as CP's earlier screenshot. They tend to "run" for 1 - 2hrs, using ALL of a core for that time then exit with an "unknown error code", (incorrect function in stderr)
Resetting the VM doesn't help as it always gets stuck at the same place.
Latest one, here, will "finish" about 20:00ish UTC so I'm just going to let it run in case it shows up anything interesting in the upload. Don't know where else to look for any other pointers.
51) Message boards : Theory Application : Windows Version (Message 6111)
Posted 27 Feb 2019 by Profile Ray Murray
Post:
I only let my machines grab a couple of these at a time, supposedly so that I can watch for irregularities, although I normally get distracted and miss whatever I was hoping to spot. On further inspection of the previous behaviour, I find that the vdi is deleted when the last task finishes, hence the new download with each new batch of work requested. Only its .xml is retained. I had earlier mistaken the deprecated 2019_02_22, which I have now deleted.
Too late at night to investigate further tonight so I'll have another look tomorrow evening, with CP's reset suggestion as first option.
52) Message boards : Theory Application : Windows Version (Message 6109)
Posted 27 Feb 2019 by Profile Ray Murray
Post:
I would try that if this was only happening on just 1 host but I find this to be repeatable over all 3 of my hosts.
It's not really a problem for me as I have a reasonably fast, unlimited connection but it might be interesting to see if anyone else could reproduce the behaviour.
53) Message boards : Theory Application : Windows Version (Message 6107)
Posted 27 Feb 2019 by Profile Ray Murray
Post:
Just repeated this to make sure I hadn't mistaken it yesterday with the reversion to 4.16.

Checked all 3 hosts have 2019_02_20 vdi.
On requesting new work, if there is already a task running, only the .run file is downloaded. (all good).
However, if the host has been allowed to run dry, even though the vdi is already in the dev project folder, it downloads the 421MB vdi again. Not good for those on limited or slow connections.
54) Message boards : Theory Application : Windows Version (Message 6089)
Posted 25 Feb 2019 by Profile Ray Murray
Post:
Picked up 7 tasks over 3 hosts yesterday. All except one completed, validated and credited with runtimes of 1 - 18 hrs. There was no Graphics output through Boinc from any of them and only Alt-F1 initial setup screen (no F2 events or Top), therefore no way to check on how any jobs were progressing other than the cpu usage.
No McPlots for any of them 8~(
No debris left in any slots or in VBox 8~)
55) Message boards : CMS Application : New version v48.30 (Message 5812)
Posted 8 Feb 2019 by Profile Ray Murray
Post:
I spotted there were some available so I let my hosts grab one each, just for the novelty value. 1 overpowered my aging host with not enough memory and had to be aborted, no surprise, really, and the other 2 succumbed to the No_Sub_Tasks error.
Back to production Theory, I suppose.
56) Message boards : Number crunching : An Alternative Approach For VM applications (Message 5552)
Posted 30 Sep 2018 by Profile Ray Murray
Post:
I'm not 100% clear on what's proposed either. [thinking it through while I type so might take a couple of edits]
Not tried the Native Alas as I don't run Linux, and anyway my machines really just aren't powerful enough for what Atlas requires.
I did briefly sample Linux to try some earlier apps here (VMs inside a VM) but couldn't get my head around the command line stuff.

Let's see if I've got this right;
Windows hosts (pobably the majority of private users) run Boinc which controls the VM inside of which runs the computation which downloads the relevant science package (Pythia, Sherpa etc.) as required, which all gets destroyed each time Boinc finishes a Task.
Is the new approach for the host to support a Linux guest VM which contains Boinc, which woud download AND STORE the packages and keep them for whenever they are needed next, with Boinc then controlling the Native apps?

Seems there is potential for reduced bandwidth requirement after the initial setup.
Would the user need to set up Boinc within theVM?
How to earn credits if there isn't a requirement for Boinc to end and report Tasks? (Like the the Christmas Challenges of a few years ago where we gathered MCPlots but no Boincs.) Might be a turnoff for some.
Would there be issues with the user pausing or shutting down the Guest VM?
One of my mahines has a noisy fan so I turn Boinc down to 50% overnight. Could I still do that without getting too Linuxy
57) Message boards : LHCb Application : New version v1.07 (Message 5527)
Posted 17 Sep 2018 by Profile Ray Murray
Post:
3 ran and finished successfully with 1 to finish shortly. All show CPU time of 1/3 or less of elapsed time. Don't know if that's expected behaviour for running single cores as I haven't done much LHCb work as my machines simply don't have the neccessary resources to run them efficiently.
58) Message boards : LHCb Application : New version v1.06 (Message 5523)
Posted 14 Sep 2018 by Profile Ray Murray
Post:
Two ran successfully here,
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2378976 and
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2378692
Both show several Jobs completed within the Task although only show 86mins CPU time for the 12hr run.
59) Message boards : ALICE Application : The ALICE Application (Message 5471)
Posted 26 Jul 2018 by Profile Ray Murray
Post:
I have a large number of OLD Alice, Atlas , LHCb , Benchark, etc., results still listed under my Tasks. Some of these are over 2 years old so surely can't still be useful. Is it not about time for a purge of these to free up some server space? I know some purges in the past haven't gone well and been over agressive but perhaps a 6 months limit (?) might allow enough records for task comparison by the user and remove redundant tasks.
60) Message boards : Sixtrack Application : Throughput Testing (Message 5412)
Posted 20 Apr 2018 by Profile Ray Murray
Post:
Thanks Laurence,
My concern was that a problem with a particular host might return an erroneous result, backed up by the return of a similarly erroneous result from the same host, resulting in a "wrong" answer being validated and treated as being "correct".

Still problem with exe download. I aborted all the offending tasks and stuck transfer, deleted the zero size file remnant in the Project folder and restarted Boinc but the new download is still stuck. No tasks available.

20/04/2018 17:57:19 | lhcathome-dev | Started download of sixtrack_win64_466_sse2.exe
20/04/2018 17:57:22 | | Project communication failed: attempting access to reference site
20/04/2018 17:57:22 | lhcathome-dev | Temporarily failed download of sixtrack_win64_466_sse2.exe: transient HTTP error
20/04/2018 17:57:22 | lhcathome-dev | Backing off 00:13:50 on download of sixtrack_win64_466_sse2.exe

although other errors have shown up as
app_version download error: couldn't get input files:
<file_xfer_error>
<file_name>sixtrack_win32_466_sse2.exe</file_name>
<error_code>-120 (RSA key check failed for file)</error_code>
<error_message>signature verification failed</error_message>
and some other ones had something about the file being the "wrong size" but I can't find them so maybe the ones that were cancelled by server.


Previous 20 · Next 20


©2024 CERN