Message boards : Theory Application : New Version 5.30
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
The 2 that I started this morning have completed successfully 8¬) but I won't know if their replacements are ok until I get home. Both were attached to the vdi, as seen in Media Manager by 2 separate long strings of characters in a dropdown. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
The 2 that I started this morning have completed successfully 8¬) ... Typical computer behaviour not to run into an error while you investigate. ... seen in Media Manager by 2 separate long strings of characters in a dropdown. An (old) example how it should look like: https://www.virtualbox.org/manual/ch05.html#diffimages The "long strings" are derived from the unique UUIDs and automatically generated by VirtualBox. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
Started 2 at once and running normal: Location: D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi Attached: D:\Boinc1\slots\1\boinc_6711fafe8fc0c562\Snapshots\{402794fc-be9e-4a4d-baf8-7050c8d550bc}.vdi D:\Boinc1\slots\0\boinc_3ca7f6ac0f658034\Snapshots\{6bb17b9a-4112-46d7-9902-59da1f789ad4}.vdi |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Theory: every fifth have this grid.cern.ch Error. from now 35 finished. One Atlas with no successful running Task. So, wrapper 204 is for me in Production-using no performance winner. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
Missing some examples. Your task lists show only 4 tasks using the new vboxwrapper instead of 35 you mentioned. Errors regarding any CVMFS repository are not related to the new vboxwrapper. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
The errors from maeax and Ray both have the lines: Another VirtualBox management application has locked the session for this VM. BOINC cannot properly monitor this VM and so this job will be aborted. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
Did anybody run the vboxmanager 1) to monitor the VMs while this error occurred? If so, please close it a while before you start a VM. The reason why is explained in the BOINC sourcecode: https://github.com/BOINC/boinc/blob/client_release/7/7.20/samples/vboxwrapper/vbox_common.cpp#L1167-L1177 The error notes from the task logfiles can be found a few lines below the comment: https://github.com/BOINC/boinc/blob/client_release/7/7.20/samples/vboxwrapper/vbox_common.cpp#L1190-L1192 1) Just a guess: I also had vboxmanager open but it might be that Linux releases the lock early enough so it doesn't run into an error. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
1) Just a guess: I also had vboxmanager open but it might be that Linux releases the lock early enough so it doesn't run into an error.If a volunteer opens another VirtualBox management application and goes poking around that application can acquire the session lock and not give it up for some time When someone is "poking around", he knows why a task is failing ;) I almost always have VirtualBox Manager open, before I require VBox-tasks. That way I'm able to adjust the process priority (in my case 'lower than normal') for VBoxSVC.exe to pass along that priority to VBoxHeadless.exe. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Missing some examples. In General-Discussion Folder is my link of the PC. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
That client runs the app_versions from the production server. Although you replaced the vboxwrapper with 26204 those app_versions are not configured to use differencing images. That's why you don't see any performance change: 2022-06-17 09:38:52 (13116): Adding virtual disk drive to VM. (vm_image.vdi) |
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
<VirtualBox xmlns="http://www.virtualbox.org/" version="1.12-windows"> Similar to Crystal, I almost always have VBox open so I can see whether the vms are being created and starting. Both tasks that replaced today's earlier successes stopped Postponed after 40? seconds. When the first successful one finished, Boinc didn't request new work until the 2nd one had finished (possibly due to my app_config limiting the number of running tasks, which I have temporarily removed) so the vdi was left unattached to anything and the replacements failed to attach to it. The app_config contains nothing other than max_concurrent lines but seemed to be causing a blockage. I sacrificed 2 LHC tasks to give -dev a clear run and so as not to cause myself unnecessary confusion. I tried a few things to get those 2 postponed's running Exit Boinc, delete powered-off vms, restart Boinc -- Postponed Exit Boinc, delete vm, empty each slot, restart Boinc -- Postponed Exit Boinc delete vm, empty slots, Remove vdi from VBox -- Successful start (I did start them individually just in case) The removal of the app_config allowed a new task to download on completion of the short-running replacement which immediately started up and successfully attached to the image which was still attached to the other running vm. While I have been writing this, another task has ended and been successfully replaced 😀 This part, at least, is now working for me. I'm still concerned that new tasks might be unable to attach to the image if there is not another one already attached. I have yet to master how to gracefully end a task so I will have to wait a few hours until these tasks finish to test whether a new task will attach to the vdi when no others are already attached. Crystal said earlier that they do but my original problem was because they didn't. [ In the job xml there is the line <completion_trigger_file>shutdown</completion_trigger_file> I created a text file named shutdown in the slot but I have no idea how to implement it Oops. I'm more graphical than Command Line so I killed one, accidentally, using fsutil to create a shutdown file but it did that very much less than gracefully and resulted in a Computation error 😕] Probably Sunday before I get another chance to play with it so I've upped the resource share so it shouldn't run dry and fired up LHC again, in case it does. Not problems; just learning opportunities. Anything is possible with the right attitude ... and a hammer 😜 |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
Please look into these logs: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3093090 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3092891 Both tasks stopped at 2022-06-17 19:29:17 BST and successfully restarted at 2022-06-17 19:29:51 BST. Do you remember what you did at that point? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
@Ray: I noticed you are running dev-Theory's and prod-Theory's simultaneously. Should not be a problem, but could be a reason. You could test without running Theory's from the production server. Your gracefully shutdown issue. You could use this batch file for gracefull shutdown a task. @echo off set "slotdir=" set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? " set boincpath="D:\Boinc1\slots\%slotdir%\shared" copy /y NUL %boincpath%\shutdown >NUL exit Of course you have to adjust the first part of boincpath. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
I try to reproduce Ray's error and with only 1 dev-Theory running I loaded 2 Theory's from production (max concurrent for LHC@home 1). The very 1st one failed. https://lhcathome.cern.ch/lhcathome/result.php?resultid=358194711 Surely not a memory issue. The 2nd task started after the computation error from task 1 and task 2 keeps on running. Meanwhile loaded 2 extra tasks from dev and a second one started. (max concurrent 2) I'll let it running for a while. 2 tasks from dev and 1 from production. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
I suspect lhcathome and lhcathome-dev are both connected to the same BOINC client and are running under the same user account. For both Theory vdi files (dev and prod directory) please run the following command and post the output: vboxmanage showhdinfo "Theory_2020_05_08.vdi" The UUIDs must not be the same. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
Just downloaded a fresh copy of "Theory_2020_05_08.vdi" from the dev server and from the prod server. Both have the same UUID and even if you change their names or put them in different directories VirtualBox refuses to register both. To allow this the UUIDs must be different. @Laurence This needs to be done on the server. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
The tasks are running by the same BOINC-client (First I wanted to use mutiple clients, but the same VBox, but thought the same client would be better test condition) The output of the vboxmanage commands: D:\Boinc1\projects\lhcathome.cern.ch_lhcathome>"%vbox_msi_install_path%\VBoxManage" showhdinfo "Theory_2020_05_08.vdi" VBoxManage.exe: error: Cannot register the hard disk 'D:\Boinc1\projects\lhcathome.cern.ch_lhcathome\Theory_2020_05_08.vdi' {c7cbbeeb-c984-467e-9b6e-0d2e670bed58} because a hard disk 'D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi' with UUID {c7cbbeeb-c984-467e-9b6e-0d2e670bed58} already exists VBoxManage.exe: error: Details: code E_INVALIDARG (0x80070057), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 191 of file VBoxManageDisk.cpp DEV-folder: D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev>"%vbox_msi_install_path%\VBoxManage" showhdinfo "Theory_2020_05_08.vdi" UUID: c7cbbeeb-c984-467e-9b6e-0d2e670bed58 Parent UUID: base State: locked read Type: multiattach Location: D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi Storage format: VDI Format variant: dynamic default Capacity: 20480 MBytes Size on disk: 781 MBytes Encryption: disabled Property: AllocationBlockSize=1048576 Child UUIDs: 4b892d0d-8e6a-4738-ae35-1f385f422135 06f98cbd-317b-41db-a68e-ca950e58a267 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 27 ![]() ![]() |
I restarted the VM's this morning after overnight shutdown and one task from dev failed. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3093141 2022-06-19 05:42:25 (1652): Error in start VM for VM: -2147467259 Command: VBoxManage -q startvm "boinc_4efaeb2254e8c83a" --type headless Output: Waiting for VM "boinc_4efaeb2254e8c83a" to power on... VBoxManage.exe: error: ahci#0: The target VM is missing a device on port 0. Please make sure the source and target VMs have compatible storage configurations [ver=9 pass=final] (VERR_SSM_LOAD_CONFIG_MISMATCH) VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole 2022-06-19 05:42:25 (1652): VM failed to start. 2022-06-19 05:42:25 (1652): Could not start 2022-06-19 05:42:25 (1652): ERROR: VM failed to start So it's clear that both VM's (dev and prod) may not be an exact copy of each other. |
![]() Send message Joined: 28 Jul 16 Posts: 486 Credit: 394,839 RAC: 0 ![]() ![]() |
So it's clear that both VM's (dev and prod) may not be an exact copy of each other. Both vdis are identical when you compare fresh downloads from the servers: cmp Theory_2020_05_08_dev.vdi Theory_2020_05_08_prod.vdi Either don't run Theory (or CMS) concurrently from dev and prod until the UUID issue is fixed. or use separate local user accounts for dev and prod. This ensures VirtualBox uses independent configuration sets. Just using a separate BOINC client under the same user account does not help as it would point to the same VirtualBox configuration. |
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
I wasn't able to pay attention yesterday but I had left Boinc running with a 1/3, LHC/dev resource share so there would always be at least 1 dev attached. They have played well together. This may be because the Theory vdi was never left without a task attached. I have yet to test whether a new task attaches after the vdi is completely released. @Computezrmle Both tasks stopped at 2022-06-17 19:29:17 BST and successfully restarted at 2022-06-17 19:29:51 BST.I had previously just Aborted postponed tasks, thinking them to be unrecoverable. This successful restart would have been where I removed the vdi from VBox. I may have then closed the VBox window, as suggested earlier. @Crystal I'll give that shutdown a try on a Production task first. 1 of the dev tasks is a 4-day Sherpa so I'd rather let it run to completion or, if the practice shutdown is successful, return at least partial useful work. I'd rather euthanise it than murder it. [Later] ☹️Well, I'm obviously doing something wrong with the graceful shutdown thing. The VMs did shut down, but not gracefully so all 3 attempts lost to Computation Error😢. However, with VBox window not open, the vdi is removed on shutdown, whereas when I had it open to see what was happening, the vdi was not removed so I believe the problem is with VBox not allowing the unattached Theory image to be fully released, with that window open, and thus not available for reattachment, rather than with the wrapper itself. The vm_image.vdi does get released even when viewing VBox so I don't know why the Theory one doesn't🤔 Easiest solution is, therefore, to not be viewing VBox while the last attached task is ending. With VBox closed, 2 dev started at the same time and both failed (not postponed, however). Starting one then another seems to be OK. 2 dev and 1 Production currently running happily.😀 I hope someone can make sense of what's been happening over the past couple of days and it will help in finding a more robust solution. Thanks C & CP for your assistance |
©2025 CERN