Message boards : Theory Application : New Version 5.30
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 7369 - Posted: 17 Jun 2022, 12:01:37 UTC

The 2 that I started this morning have completed successfully 8¬) but I won't know if their replacements are ok until I get home.
Both were attached to the vdi, as seen in Media Manager by 2 separate long strings of characters in a dropdown.
ID: 7369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7370 - Posted: 17 Jun 2022, 12:19:56 UTC - in response to Message 7369.  

The 2 that I started this morning have completed successfully 8¬) ...

Typical computer behaviour not to run into an error while you investigate.


... seen in Media Manager by 2 separate long strings of characters in a dropdown.

An (old) example how it should look like:
https://www.virtualbox.org/manual/ch05.html#diffimages
The "long strings" are derived from the unique UUIDs and automatically generated by VirtualBox.
ID: 7370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7371 - Posted: 17 Jun 2022, 12:27:00 UTC

Started 2 at once and running normal:

Location: D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi
Attached:
D:\Boinc1\slots\1\boinc_6711fafe8fc0c562\Snapshots\{402794fc-be9e-4a4d-baf8-7050c8d550bc}.vdi
D:\Boinc1\slots\0\boinc_3ca7f6ac0f658034\Snapshots\{6bb17b9a-4112-46d7-9902-59da1f789ad4}.vdi
ID: 7371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 667
Credit: 1,807,614
RAC: 2,394
Message 7372 - Posted: 17 Jun 2022, 12:30:52 UTC - in response to Message 7371.  

Theory: every fifth have this grid.cern.ch Error. from now 35 finished.
One Atlas with no successful running Task.
So, wrapper 204 is for me in Production-using no performance winner.
ID: 7372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7374 - Posted: 17 Jun 2022, 12:48:00 UTC - in response to Message 7372.  

Missing some examples.
Your task lists show only 4 tasks using the new vboxwrapper instead of 35 you mentioned.


Errors regarding any CVMFS repository are not related to the new vboxwrapper.
ID: 7374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7375 - Posted: 17 Jun 2022, 12:57:56 UTC

The errors from maeax and Ray both have the lines:

Another VirtualBox management application has locked the session for
this VM. BOINC cannot properly monitor this VM
and so this job will be aborted.
ID: 7375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7377 - Posted: 17 Jun 2022, 13:29:02 UTC - in response to Message 7375.  

Did anybody run the vboxmanager 1) to monitor the VMs while this error occurred?
If so, please close it a while before you start a VM.

The reason why is explained in the BOINC sourcecode:
https://github.com/BOINC/boinc/blob/client_release/7/7.20/samples/vboxwrapper/vbox_common.cpp#L1167-L1177

The error notes from the task logfiles can be found a few lines below the comment:
https://github.com/BOINC/boinc/blob/client_release/7/7.20/samples/vboxwrapper/vbox_common.cpp#L1190-L1192




1) Just a guess: I also had vboxmanager open but it might be that Linux releases the lock early enough so it doesn't run into an error.
ID: 7377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7378 - Posted: 17 Jun 2022, 13:44:39 UTC - in response to Message 7377.  

1) Just a guess: I also had vboxmanager open but it might be that Linux releases the lock early enough so it doesn't run into an error.
If a volunteer opens another VirtualBox management application and goes poking around
that application can acquire the session lock and not give it up for some time


When someone is "poking around", he knows why a task is failing ;)

I almost always have VirtualBox Manager open, before I require VBox-tasks.
That way I'm able to adjust the process priority (in my case 'lower than normal') for VBoxSVC.exe to pass along that priority to VBoxHeadless.exe.
ID: 7378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 667
Credit: 1,807,614
RAC: 2,394
Message 7379 - Posted: 17 Jun 2022, 13:55:17 UTC - in response to Message 7374.  
Last modified: 17 Jun 2022, 13:56:00 UTC

Missing some examples.
Your task lists show only 4 tasks using the new vboxwrapper instead of 35 you mentioned.


Errors regarding any CVMFS repository are not related to the new vboxwrapper.


In General-Discussion Folder is my link of the PC.
ID: 7379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7380 - Posted: 17 Jun 2022, 14:11:27 UTC - in response to Message 7379.  

That client runs the app_versions from the production server.
Although you replaced the vboxwrapper with 26204 those app_versions are not configured to use differencing images.
That's why you don't see any performance change:
2022-06-17 09:38:52 (13116): Adding virtual disk drive to VM. (vm_image.vdi)
ID: 7380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 7384 - Posted: 17 Jun 2022, 20:11:41 UTC
Last modified: 17 Jun 2022, 20:17:47 UTC

<VirtualBox xmlns="http://www.virtualbox.org/" version="1.12-windows">
Similar to Crystal, I almost always have VBox open so I can see whether the vms are being created and starting.

Both tasks that replaced today's earlier successes stopped Postponed after 40? seconds. When the first successful one finished, Boinc didn't request new work until the 2nd one had finished (possibly due to my app_config limiting the number of running tasks, which I have temporarily removed) so the vdi was left unattached to anything and the replacements failed to attach to it. The app_config contains nothing other than max_concurrent lines but seemed to be causing a blockage. I sacrificed 2 LHC tasks to give -dev a clear run and so as not to cause myself unnecessary confusion.

I tried a few things to get those 2 postponed's running
Exit Boinc, delete powered-off vms, restart Boinc -- Postponed
Exit Boinc, delete vm, empty each slot, restart Boinc -- Postponed
Exit Boinc delete vm, empty slots, Remove vdi from VBox -- Successful start (I did start them individually just in case)

The removal of the app_config allowed a new task to download on completion of the short-running replacement which immediately started up and successfully attached to the image which was still attached to the other running vm.

While I have been writing this, another task has ended and been successfully replaced 😀
This part, at least, is now working for me. I'm still concerned that new tasks might be unable to attach to the image if there is not another one already attached.

I have yet to master how to gracefully end a task so I will have to wait a few hours until these tasks finish to test whether a new task will attach to the vdi when no others are already attached. Crystal said earlier that they do but my original problem was because they didn't.

[ In the job xml there is the line <completion_trigger_file>shutdown</completion_trigger_file>
I created a text file named shutdown in the slot but I have no idea how to implement it
Oops. I'm more graphical than Command Line so I killed one, accidentally, using fsutil to create a shutdown file but it did that very much less than gracefully and resulted in a Computation error 😕]

Probably Sunday before I get another chance to play with it so I've upped the resource share so it shouldn't run dry and fired up LHC again, in case it does.

Not problems; just learning opportunities.
Anything is possible with the right attitude ... and a hammer 😜
ID: 7384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7385 - Posted: 18 Jun 2022, 5:24:09 UTC - in response to Message 7384.  

Please look into these logs:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3093090
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3092891

Both tasks stopped at 2022-06-17 19:29:17 BST and successfully restarted at 2022-06-17 19:29:51 BST.
Do you remember what you did at that point?
ID: 7385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7386 - Posted: 18 Jun 2022, 7:57:07 UTC

@Ray:
I noticed you are running dev-Theory's and prod-Theory's simultaneously.
Should not be a problem, but could be a reason. You could test without running Theory's from the production server.

Your gracefully shutdown issue. You could use this batch file for gracefull shutdown a task.
@echo off
set "slotdir="
set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? "
set boincpath="D:\Boinc1\slots\%slotdir%\shared"
copy /y NUL %boincpath%\shutdown >NUL
exit

Of course you have to adjust the first part of boincpath.
ID: 7386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7388 - Posted: 18 Jun 2022, 17:56:28 UTC

I try to reproduce Ray's error and with only 1 dev-Theory running I loaded 2 Theory's from production (max concurrent for LHC@home 1).
The very 1st one failed. https://lhcathome.cern.ch/lhcathome/result.php?resultid=358194711 Surely not a memory issue.
The 2nd task started after the computation error from task 1 and task 2 keeps on running.
Meanwhile loaded 2 extra tasks from dev and a second one started. (max concurrent 2)
I'll let it running for a while. 2 tasks from dev and 1 from production.
ID: 7388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7389 - Posted: 18 Jun 2022, 18:52:35 UTC - in response to Message 7388.  

I suspect lhcathome and lhcathome-dev are both connected to the same BOINC client and are running under the same user account.

For both Theory vdi files (dev and prod directory) please run the following command and post the output:
vboxmanage showhdinfo "Theory_2020_05_08.vdi"

The UUIDs must not be the same.
ID: 7389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7390 - Posted: 18 Jun 2022, 19:22:07 UTC - in response to Message 7389.  

Just downloaded a fresh copy of "Theory_2020_05_08.vdi" from the dev server and from the prod server.
Both have the same UUID and even if you change their names or put them in different directories VirtualBox refuses to register both.
To allow this the UUIDs must be different.


@Laurence
This needs to be done on the server.
ID: 7390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7392 - Posted: 18 Jun 2022, 20:05:00 UTC - in response to Message 7389.  

The tasks are running by the same BOINC-client (First I wanted to use mutiple clients, but the same VBox, but thought the same client would be better test condition)

The output of the vboxmanage commands:

D:\Boinc1\projects\lhcathome.cern.ch_lhcathome>"%vbox_msi_install_path%\VBoxManage" showhdinfo "Theory_2020_05_08.vdi"
VBoxManage.exe: error: Cannot register the hard disk 'D:\Boinc1\projects\lhcathome.cern.ch_lhcathome\Theory_2020_05_08.vdi' {c7cbbeeb-c984-467e-9b6e-0d2e670bed58} because a hard disk 'D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi' with UUID {c7cbbeeb-c984-467e-9b6e-0d2e670bed58} already exists
VBoxManage.exe: error: Details: code E_INVALIDARG (0x80070057), component VirtualBoxWrap, interface IVirtualBox, callee IUnknown
VBoxManage.exe: error: Context: "OpenMedium(Bstr(pszFilenameOrUuid).raw(), enmDevType, enmAccessMode, fForceNewUuidOnOpen, pMedium.asOutParam())" at line 191 of file VBoxManageDisk.cpp


DEV-folder:

D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev>"%vbox_msi_install_path%\VBoxManage" showhdinfo "Theory_2020_05_08.vdi"
UUID: c7cbbeeb-c984-467e-9b6e-0d2e670bed58
Parent UUID: base
State: locked read
Type: multiattach
Location: D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\Theory_2020_05_08.vdi
Storage format: VDI
Format variant: dynamic default
Capacity: 20480 MBytes
Size on disk: 781 MBytes
Encryption: disabled
Property: AllocationBlockSize=1048576
Child UUIDs: 4b892d0d-8e6a-4738-ae35-1f385f422135
06f98cbd-317b-41db-a68e-ca950e58a267
ID: 7392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1181
Credit: 815,417
RAC: 205
Message 7393 - Posted: 19 Jun 2022, 4:02:09 UTC

I restarted the VM's this morning after overnight shutdown and one task from dev failed.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3093141

2022-06-19 05:42:25 (1652): Error in start VM for VM: -2147467259
Command:
VBoxManage -q startvm "boinc_4efaeb2254e8c83a" --type headless
Output:
Waiting for VM "boinc_4efaeb2254e8c83a" to power on...
VBoxManage.exe: error: ahci#0: The target VM is missing a device on port 0. Please make sure the source and target VMs have compatible storage configurations [ver=9 pass=final] (VERR_SSM_LOAD_CONFIG_MISMATCH)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole

2022-06-19 05:42:25 (1652): VM failed to start.
2022-06-19 05:42:25 (1652): Could not start
2022-06-19 05:42:25 (1652): ERROR: VM failed to start


So it's clear that both VM's (dev and prod) may not be an exact copy of each other.
ID: 7393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 475
Credit: 389,411
RAC: 28
Message 7394 - Posted: 19 Jun 2022, 5:09:29 UTC - in response to Message 7393.  

So it's clear that both VM's (dev and prod) may not be an exact copy of each other.

Both vdis are identical when you compare fresh downloads from the servers:
cmp Theory_2020_05_08_dev.vdi Theory_2020_05_08_prod.vdi


Either
don't run Theory (or CMS) concurrently from dev and prod until the UUID issue is fixed.

or
use separate local user accounts for dev and prod.
This ensures VirtualBox uses independent configuration sets.
Just using a separate BOINC client under the same user account does not help as it would point to the same VirtualBox configuration.
ID: 7394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 7396 - Posted: 19 Jun 2022, 15:07:27 UTC
Last modified: 19 Jun 2022, 15:30:19 UTC

I wasn't able to pay attention yesterday but I had left Boinc running with a 1/3, LHC/dev resource share so there would always be at least 1 dev attached. They have played well together. This may be because the Theory vdi was never left without a task attached. I have yet to test whether a new task attaches after the vdi is completely released.

@Computezrmle
Both tasks stopped at 2022-06-17 19:29:17 BST and successfully restarted at 2022-06-17 19:29:51 BST.
Do you remember what you did at that point?
I had previously just Aborted postponed tasks, thinking them to be unrecoverable. This successful restart would have been where I removed the vdi from VBox. I may have then closed the VBox window, as suggested earlier.

@Crystal
I'll give that shutdown a try on a Production task first. 1 of the dev tasks is a 4-day Sherpa so I'd rather let it run to completion or, if the practice shutdown is successful, return at least partial useful work. I'd rather euthanise it than murder it.

[Later]
☹️Well, I'm obviously doing something wrong with the graceful shutdown thing. The VMs did shut down, but not gracefully so all 3 attempts lost to Computation Error😢. However, with VBox window not open, the vdi is removed on shutdown, whereas when I had it open to see what was happening, the vdi was not removed so I believe the problem is with VBox not allowing the unattached Theory image to be fully released, with that window open, and thus not available for reattachment, rather than with the wrapper itself. The vm_image.vdi does get released even when viewing VBox so I don't know why the Theory one doesn't🤔
Easiest solution is, therefore, to not be viewing VBox while the last attached task is ending.
With VBox closed, 2 dev started at the same time and both failed (not postponed, however). Starting one then another seems to be OK.
2 dev and 1 Production currently running happily.😀

I hope someone can make sense of what's been happening over the past couple of days and it will help in finding a more robust solution.
Thanks C & CP for your assistance
ID: 7396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : New Version 5.30


©2024 CERN