1) Message boards : Theory Application : Veeerrrry long Pythia8 (Message 8300)
Posted 21 Jan 2024 by Profile Ray Murray
Post:
Problem solved ... sort of.
I gently suspended both running tasks individually and shut down BOINC to do something else and, on restart of BOINC, the other task picked up from where it left off but this one started from scratch again so absolutely no chance of it finishing within deadline so I've killed it. Hopefully the next recipient will be a whizzier machine and be able to compete it timeously.
2) Message boards : Theory Application : Veeerrrry long Pythia8 (Message 8297)
Posted 19 Jan 2024 by Profile Ray Murray
Post:
These are often very long but I got home today to find
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2377057 has been running for 17hrs already. I check those I find in case they need to be resuscitated but this one is actually running fine and still writing to logs, just very slowly (46mins per 100 events 🐌)
It's running Fri Jan 19 01:22:14 UTC 2024 [boinc pp jets 13000 300 - pythia8 8.243 CP1-CR1 100000 34] so it's a 100,000 event job but has so far completed only 3500 so it would only be about 50% complete at the 10 day timeout. If I run it to timeout, will it send back anything useful that it has done and that it is only 50% complete and resend as a 50,000 event job or resend again as a 100,000 so the next recipient will have the same problem?

I wonder if there is any value in me extending its job duration to 20 days and letting it run to completion? I wouldn't get any credit for running it to Timeout or beyond but I wonder if its final output would still be of value?
I'm tempted just to kill it now and get a replacement that will complete in a timely fashion.
3) Message boards : Theory Application : All errors (Message 8259)
Posted 23 Dec 2023 by Profile Ray Murray
Post:
I rest the project to pick up the new version and clear the debris. As I have Production running, too, both task I got failed due to

VBoxManage.exe: error: Cannot register the hard disk 'C:\ProgramData\BOINC\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2023_12_13.vdi' {a9d19666-9f42-47d2-9b06-f58d52e3215c} because a hard disk 'C:\ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome\Theory_2023_12_13.vdi' with UUID {a9d19666-9f42-47d2-9b06-f58d52e3215c} already exists

I'll let Production run dry while I'm out at work and try -dev on its own this evening.
4) Message boards : Theory Application : All errors (Message 8254)
Posted 15 Dec 2023 by Profile Ray Murray
Post:
The past few days (since version 5.95?), I have had all errors but have only now had the time to have a closer look. I reset the project to start afresh and saw nothing untoward in VBox, but I think I may have tracked down the cause.
I hope it's this simple:
The vbox_job has the line
<multiattach_vdi_file>Theory_2023_12_12.vdi</multiattach_vdi_file>

and the stderr has
Adding virtual disk drive to VM. (Theory_2023_12_12.vdi)

but it can't find that so throws an error
but the .vdi is called
Theory_2023_12_13.vdi

Production runs fine so I have edited the vbox_job of -dev accordingly but after having so many errors, I can't fetch a job to test whether that will work.


Boinc 7.24.1
Win64
VBox 7.0.12
5) Message boards : Sixtrack Application : Xtrack beam simulation (Message 7934)
Posted 7 Mar 2023 by Profile Ray Murray
Post:
Two Xtracks that I'm about to euthanise 2283960 and 2283961. Both completed on Linux in a couple of seconds but the Windows wingman Aborted after 2 days. Is that significant? Mine (Windows) are 29 and 21 hours respectively. Both have been using all of a core each but I'm pretty sure they haven't actually done anything.

This host only has onboard graphics. Are these supposed to only go to proper GPUs ?

Boboviz's github link looks far to technical for me. If setting all that up is required to get these running, I'll set to not accept these for now and just go back to Theory tasks.
6) Message boards : Sixtrack Application : Xtrack beam simulation (Message 7916)
Posted 2 Feb 2023 by Profile Ray Murray
Post:
I don't know what the expected behaviour of these is so here are my observations in the hope that they are helpful:
I had a successful 20 second Xtrack yesterday, where the wingmen completed in similar time but with error, and another running since yesterday. Progress 100%, 33hrs elapsed, none remaining. In Properties, CPU Time and Since last checkpoint are ~1/2 hr behind Elapsed Time but, in the slot, stderr says
12:29:53 (13692): app started; CPU time 20.000000, flags:
but nothing has been written to any of the slot files since yesterday. It's using all of one core. The wingman completed in 2.05s so I suspect it is stuck.
A restart of Boinc reset the clocks
Initial estimate of 3?mins 50% reached at 2mins but progress, and time remaining, getting progressively slower. 91% at 7mins. 99.60% at 23mins, remaining 00:00:00. Using ~70% of a core. 99.999% at 33mins. 100% at 45mins. Using almost all of a core and I'll let it do that overnight but I expect it will have to be euthanised tomorrow.
7) Message boards : General Discussion : Boinc VM App (Message 7835)
Posted 21 Oct 2022 by Profile Ray Murray
Post:
It was an experimental app from July/August 2019 that was useful in developing the apps we are running now. Those unsent tasks became "stuck" when that experiment concluded with the release of the app(s) that replaced it.
It's probably more trouble than it's worth to manually remove the offending tasks.
8) Message boards : Theory Application : New Version 5.40 (Message 7719)
Posted 2 Aug 2022 by Profile Ray Murray
Post:
Great, thanks for that clarification update πŸ‘
9) Message boards : Theory Application : New Version 5.40 (Message 7707)
Posted 1 Aug 2022 by Profile Ray Murray
Post:
Ah, ok. I've not been paying attention recently, just been letting run with the VBox window closed.
Really we're waiting for VBox themselves for a proper fix.
10) Message boards : Theory Application : New Version 5.40 (Message 7705)
Posted 1 Aug 2022 by Profile Ray Murray
Post:
5.40 Theory -dev and Production Theory seem to have been playing nicely together overnight and today but I would still like to see if, on finishing, the last attached -dev allows the .vdi to be released. It was the transition where there was no -dev attached but the vdi was not released, and subsequent tasks could not re-attach to it, was where I previously had postponements. As is typical when I want to watch that, I have a task that is at 60% after 14hrs so it'll be tomorrow evening before I can watch what happens.

Win10
Boinc 7.20.2
VBox 6.1.36
11) Message boards : ATLAS Application : ATLAS vbox v.1.13 (Message 7505)
Posted 4 Jul 2022 by Profile Ray Murray
Post:
There still appears to be a conflict when running LHC and -dev together. I tried a few LHC over the weekend and THEY run fine but new -dev tasks all stopped with a similar error. Setting No New Tasks and allowing the LHC ones to finish, I then exited Boinc, deleted everything in the slots and removed the .vdi from VBox, then closed the VBox window as advised previously. Restarted Boinc and all is well again with only -dev tasks running. (This is with Theory but seems it might also apply to Atlas if Maeax is running LHC and -dev)
12) Message boards : Theory Application : New Version 5.30 (Message 7399)
Posted 19 Jun 2022 by Profile Ray Murray
Post:
Ah, ok. I was hoping, perhaps too optimistically, that it would be more elegant, with a return of the partially completed work.
13) Message boards : Theory Application : New Version 5.30 (Message 7396)
Posted 19 Jun 2022 by Profile Ray Murray
Post:
I wasn't able to pay attention yesterday but I had left Boinc running with a 1/3, LHC/dev resource share so there would always be at least 1 dev attached. They have played well together. This may be because the Theory vdi was never left without a task attached. I have yet to test whether a new task attaches after the vdi is completely released.

@Computezrmle
Both tasks stopped at 2022-06-17 19:29:17 BST and successfully restarted at 2022-06-17 19:29:51 BST.
Do you remember what you did at that point?
I had previously just Aborted postponed tasks, thinking them to be unrecoverable. This successful restart would have been where I removed the vdi from VBox. I may have then closed the VBox window, as suggested earlier.

@Crystal
I'll give that shutdown a try on a Production task first. 1 of the dev tasks is a 4-day Sherpa so I'd rather let it run to completion or, if the practice shutdown is successful, return at least partial useful work. I'd rather euthanise it than murder it.

[Later]
☹️Well, I'm obviously doing something wrong with the graceful shutdown thing. The VMs did shut down, but not gracefully so all 3 attempts lost to Computation Error😒. However, with VBox window not open, the vdi is removed on shutdown, whereas when I had it open to see what was happening, the vdi was not removed so I believe the problem is with VBox not allowing the unattached Theory image to be fully released, with that window open, and thus not available for reattachment, rather than with the wrapper itself. The vm_image.vdi does get released even when viewing VBox so I don't know why the Theory one doesn'tπŸ€”
Easiest solution is, therefore, to not be viewing VBox while the last attached task is ending.
With VBox closed, 2 dev started at the same time and both failed (not postponed, however). Starting one then another seems to be OK.
2 dev and 1 Production currently running happily.πŸ˜€

I hope someone can make sense of what's been happening over the past couple of days and it will help in finding a more robust solution.
Thanks C & CP for your assistance
14) Message boards : Theory Application : New Version 5.30 (Message 7384)
Posted 17 Jun 2022 by Profile Ray Murray
Post:
<VirtualBox xmlns="http://www.virtualbox.org/" version="1.12-windows">
Similar to Crystal, I almost always have VBox open so I can see whether the vms are being created and starting.

Both tasks that replaced today's earlier successes stopped Postponed after 40? seconds. When the first successful one finished, Boinc didn't request new work until the 2nd one had finished (possibly due to my app_config limiting the number of running tasks, which I have temporarily removed) so the vdi was left unattached to anything and the replacements failed to attach to it. The app_config contains nothing other than max_concurrent lines but seemed to be causing a blockage. I sacrificed 2 LHC tasks to give -dev a clear run and so as not to cause myself unnecessary confusion.

I tried a few things to get those 2 postponed's running
Exit Boinc, delete powered-off vms, restart Boinc -- Postponed
Exit Boinc, delete vm, empty each slot, restart Boinc -- Postponed
Exit Boinc delete vm, empty slots, Remove vdi from VBox -- Successful start (I did start them individually just in case)

The removal of the app_config allowed a new task to download on completion of the short-running replacement which immediately started up and successfully attached to the image which was still attached to the other running vm.

While I have been writing this, another task has ended and been successfully replaced πŸ˜€
This part, at least, is now working for me. I'm still concerned that new tasks might be unable to attach to the image if there is not another one already attached.

I have yet to master how to gracefully end a task so I will have to wait a few hours until these tasks finish to test whether a new task will attach to the vdi when no others are already attached. Crystal said earlier that they do but my original problem was because they didn't.

[ In the job xml there is the line <completion_trigger_file>shutdown</completion_trigger_file>
I created a text file named shutdown in the slot but I have no idea how to implement it
Oops. I'm more graphical than Command Line so I killed one, accidentally, using fsutil to create a shutdown file but it did that very much less than gracefully and resulted in a Computation error πŸ˜•]

Probably Sunday before I get another chance to play with it so I've upped the resource share so it shouldn't run dry and fired up LHC again, in case it does.

Not problems; just learning opportunities.
Anything is possible with the right attitude ... and a hammer 😜
15) Message boards : Theory Application : New Version 5.30 (Message 7369)
Posted 17 Jun 2022 by Profile Ray Murray
Post:
The 2 that I started this morning have completed successfully 8Β¬) but I won't know if their replacements are ok until I get home.
Both were attached to the vdi, as seen in Media Manager by 2 separate long strings of characters in a dropdown.
16) Message boards : Theory Application : New Version 5.30 (Message 7364)
Posted 17 Jun 2022 by Profile Ray Murray
Post:
Thanks for looking,
I'll look for that when I get home. I don't know what my earliest version of VBox was but it would have been from the time of the first Theory jobs. There have been many uninstalls and upgrades since then so I wouldn't think there would be any of an old version left over, unless there is some fragment lurking somewhere in registry.

The overnight test didn't work as well as expected, with another Postponed task. I Aborted it and again manually removed the powered-off vm and the image to allow another to start.
I'll report in again when I get home c.17:00UTC
17) Message boards : Theory Application : New Version 5.30 (Message 7360)
Posted 16 Jun 2022 by Profile Ray Murray
Post:
Conjecture awaiting further observation:
A single task registers the .vdi in VBox Media Manager and the vm attaches and starts up successfully. Starting a 2nd task also successfully attaches to the image and runs (all good so far). If one of those tasks finishes while another is still running, the ending one detaches and a new one attaches (again, all good). If there is continuity of at least one vm attached to the image then there is continued success

BUT if an ending task is not replaced and the last connected vm detaches, such that there is no vm attached to the image, the image remains in Media Manager but subsequent tasks are unable to attach to it, resulting in the Postponed/cleanup error. Manual removal of the image in VBox before a new task starts allows normal service to resume.

Overnight, I have limited LHC to only one running task to test Part 1 so there should always be at least one -dev task attached, with rolling replacement, and I don't expect any problem.
Part 2 will need closer observation to confirm but I won't be able to do that until Friday evening after work as my other host has died so I'm down to only this one.
18) Message boards : Theory Application : New Version 5.30 (Message 7358)
Posted 16 Jun 2022 by Profile Ray Murray
Post:
My tasks https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=196
1 success then a few Postponed:environment needs to be cleaned up (or similar wording)
A Project reset got it working again for a while but, returning home this evening, I find 2 more Postponed. A Boinc restart lets them start again but, although running in Boinc, the VM shows FATAL: could not read from the boot medium! System halted. I suspect they would do nothing useful until timeout so I have Aborted them, which leaves behind the Powered-off vm which has to be manually removed. The Theory_2020_05_08.vdi also needs to be Removed (but kept) to allow the next task to start successfully. (2 instances are attached to that when they are running correctly)
I have 3 cores allocated to Boinc with a maximum of 2 from either LHC or -dev allowed to run concurrently (2 LHC, 1 -dev or 1 LHC, 2 -dev) so 5 consecutive successes suggests that it does sometimes clean up on the way out, but the the Postponed ones suggest this is not always the case. Maybe sometimes being Multi-attached, sometimes singly, is confusing it?
1 -dev & 2 LHC running just now after manually doing the cleanup.

I don't see others reporting similar issues but I hope this input is helpful.

Win 10
19) Message boards : Theory Application : New Version v5.19 (Message 6946)
Posted 15 Jan 2020 by Profile Ray Murray
Post:
My Linux VBox Theory 5.19s show the same output, so not just Windows.
Otherwise it works fine, just doesn't show the job in Alt-F1. Still available from 1st line of running.log.

5.20 for Windows does indeed fix this.
20) Message boards : ATLAS Application : Testing CentOS 7 vbox image (Message 6802)
Posted 2 Nov 2019 by Profile Ray Murray
Post:
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED
Quite annoying after 4+ days as it otherwise ran fine.

Logs show
2019-11-01 23:36:09 (1796): Guest Log: HITS file was successfully produced

2019-11-01 23:36:42 (1796): Guest Log: Successfully finished the ATLAS job!
and
2019-11-01 23:36:49 (1796): Guest Log: *** Success! Shutting down the machine. ***
so it would seem the error happened during post-processing.

Forgot to mention
I'm also seeing the full download if there is a break in work. If work is contiguous it only downloads the stuff to run the new job but if there is a break between finishing and uploading a job and requesting a new one, the vdi is downloaded again.
I've not checked if the vdi is being deleted on completion or if it is being overwritten. I'll look 2moro.


Next 20


©2024 CERN