1) Message boards : CMS Application : CMS multi-core (Message 8437)
Posted 1 hour ago by computezrmle
Post:
Ivan(at prod) wrote:
At present we have two workflows running. One is set to run 503,000 events/job (as was the template it was derived from) and takes about 5-6 hours wall-time. The other is set to 50,000 events/job and runs about one hour clock time. If we run out of jobs before the weekend, I'll submit a batch with 100,000 events/job, to match the 2-hour average our previous tasks took. These jobs generate considerably less output per CPU-hour than our previous ones.

Might be a result of the different batches with different event counts.
Unfortunately we still can't look into the job details to see what kind of job a VM is currently running.
2) Message boards : CMS Application : CMS multi-core (Message 8435)
Posted 1 hour ago by computezrmle
Post:
Some computers running CMS tasks only run empty envelopes.
Unfortunately their user(s) did not make them visible for others, hence can't be directly informed.

Please check the following settings:
- recent CMS tasks require at least a computer reporting 4 cores
- computers reporting less than 4 cores (e.g. just 1) should not run recent CMS tasks

Examples:
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4980
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4800
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4803
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4810
3) Message boards : General Discussion : Vboxwrapper race mitigation (Message 8434)
Posted 2 hours ago by computezrmle
Post:
Vboxwrapper used for CMS v61.01 and Theory 6.01 introduces a short living lock to protect multiattach disk operations against race conditions.
Those can happen when no task from a distinct subproject is running and a BOINC client then starts a couple of them concurrently.
Root cause for the race condition is a design decision in VirtualBox that does not allow to attach a 'mutiattach' virtual disk within 1 step.

Vboxwrapper currently used here reports to be v26207.
In fact, it is a nightly build from BOINC on github that is based on v26207 but includes the relevant PRs 5571 and 5598.
Once available the final version will report an updated version number, most likely 26208.

During normal operation there's no difference between v26206 (used before) and v26207+.


Relevant information can be found in stderr.txt.
It looks like this:
2024-04-27 08:53:41 (15636): Adding virtual disk drive to VM. (Theory_2024_04_26_dev.vdi)
2024-04-27 08:53:48 (15636): Attempts: 5

The attempts line is suppressed if the lock can be set at the 1st attempt.
Otherwise it shows how often a vboxwrapper instance had to go through the 'lock acquire' loop until it could get the lock.
Each vboxwrapper that can't get the lock sleeps for a short period of time until the next attempt.
A timeout (currently 90 s) avoids an endless loop.


Messages like this are still included in vbox_trace.txt to indicate they can be identified:
Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" 
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 781 of file VBoxManageStorageController.cpp


The cleanup follows immediately and looks like this.
Here, vboxwrapper also automatically removes 3 child disk orphans:
2024-04-27 08:53:41 (15633): 
Command: VBoxManage -q showhdinfo "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" 
Exit Code: 0
Output:
UUID:           09e7e89e-310f-45d3-b402-27d8c420e14e
Parent UUID:    base
State:          created
Type:           multiattach
Location:       /home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       20480 MBytes
Size on disk:   781 MBytes
Encryption:     disabled
Property:       AllocationBlockSize=1048576
Child UUIDs:    4a8a5d30-405d-4666-89ee-fd139047c9c5
                d6e4263f-9efb-40ce-9193-c5a2d39cf558
                ea7a4528-2a03-48a8-9de4-280c6eb77d8c

2024-04-27 08:53:41 (15633): 
Command: VBoxManage -q showhdinfo "4a8a5d30-405d-4666-89ee-fd139047c9c5" 
Exit Code: 0
Output:
UUID:           4a8a5d30-405d-4666-89ee-fd139047c9c5
Parent UUID:    09e7e89e-310f-45d3-b402-27d8c420e14e
State:          inaccessible
Access Error:   Could not open the medium '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi'.
VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi' (VERR_FILE_NOT_FOUND)
Type:           normal (differencing)
Auto-Reset:     off
Location:       /home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       0 MBytes
Size on disk:   0 MBytes
Encryption:     disabled
Property:       AllocationBlockSize=

2024-04-27 08:53:41 (15633): 
Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi" 
Exit Code: 0
Output:

2024-04-27 08:53:41 (15633): 
Command: VBoxManage -q showhdinfo "d6e4263f-9efb-40ce-9193-c5a2d39cf558" 
Exit Code: 0
Output:
UUID:           d6e4263f-9efb-40ce-9193-c5a2d39cf558
Parent UUID:    09e7e89e-310f-45d3-b402-27d8c420e14e
State:          inaccessible
Access Error:   Could not open the medium '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi'.
VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi' (VERR_FILE_NOT_FOUND)
Type:           normal (differencing)
Auto-Reset:     off
Location:       /home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       0 MBytes
Size on disk:   0 MBytes
Encryption:     disabled
Property:       AllocationBlockSize=

2024-04-27 08:53:42 (15633): 
Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi" 
Exit Code: 0
Output:

2024-04-27 08:53:42 (15633): 
Command: VBoxManage -q showhdinfo "ea7a4528-2a03-48a8-9de4-280c6eb77d8c" 
Exit Code: 0
Output:
UUID:           ea7a4528-2a03-48a8-9de4-280c6eb77d8c
Parent UUID:    09e7e89e-310f-45d3-b402-27d8c420e14e
State:          inaccessible
Access Error:   Could not open the medium '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi'.
VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi' (VERR_FILE_NOT_FOUND)
Type:           normal (differencing)
Auto-Reset:     off
Location:       /home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi
Storage format: VDI
Format variant: dynamic default
Capacity:       0 MBytes
Size on disk:   0 MBytes
Encryption:     disabled
Property:       AllocationBlockSize=

2024-04-27 08:53:42 (15633): 
Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi" 
Exit Code: 0
Output:

2024-04-27 08:53:42 (15633): 
Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" 
Exit Code: 0
Output:



After cleanup vboxwrapper can attach the vdi and set the 'multiattach' flag:
2024-04-27 08:53:42 (15633): 
Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" 
Exit Code: 0
Output:

2024-04-27 08:53:43 (15633): 
Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium none 
Exit Code: 0
Output:

2024-04-27 08:53:43 (15633): 
Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" 
Exit Code: 0
Output:



The same lock is also set when vboxwrapper deregisters a VM that uses a multiattach disk.
Although concurrent operations are rare, they may appear as 'Attempts: n' close to the end of stderr.txt.
4) Message boards : Theory Application : New version v6.00 (Message 8426)
Posted 2 days ago by computezrmle
Post:
Theory 6.00 and 6.01 are there to test vboxwrapper modifications.

The modifications address 2 errors:
1. "the media type 'MultiAttach' can only be attached to machines that were
created with VirtualBox 4.0 or later"
2. "Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/
Theory_2023_12_13.vdi' because it has 2 child media"

Other errors, especially those returned from CVMFS or deeper level scripts are out of scope.


Details can be found here:
https://github.com/BOINC/boinc/pull/5571
5) Message boards : Theory Application : New version v6.00 (Message 8422)
Posted 2 days ago by computezrmle
Post:
This is wrong in Theory_2024_04_26_dev.xml:
<multiattach_vdi_file>Theory_2024_04_26_dev.xml</multiattach_vdi_file>

Should be:
<multiattach_vdi_file>Theory_2024_04_26_dev.vdi</multiattach_vdi_file>
6) Message boards : CMS Application : CMS multi-core (Message 8412)
Posted 11 days ago by computezrmle
Post:
Unlike VirtualBox VMWare is out of scope for CMS multi-core.
Hence, discussing VirtualBox settings may be useful while a discussion about VMWare vs. VirtualBox is not.
It just moves the focus off.
7) Message boards : CMS Application : CMS multi-core (Message 8410)
Posted 12 days ago by computezrmle
Post:
Your Ryzen 7 1700 has 8 physical cores and 16 logical cores.
Your 1st log shows that you configured a 15-core VM (meanwhile you use 4-core VMs):
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3314981
2024-04-02 11:41:10 (6376): Setting CPU Count for VM. (15)


15-core VMs should not be configured on an 8 core (physical cores) computer.
Instead, each VM should not exceed the number of physical cores.

See a detailed comment about that here:
https://forums.virtualbox.org/viewtopic.php?t=77413

I suggest to respect that limit to avoid issues being introduced in any test here that have nothing to do with CERN.
8) Message boards : Theory Application : Suspend/Resume (Message 8399)
Posted 19 days ago by computezrmle
Post:
The following comments are from the original BOINC service file on github:
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true  #Block X11 idle detection


On your system the security policy is too strict and inhibits communication among relevant services required for ATLAS/Theory native.
ProtectControlGroups=yes
ProtectHome=yes
ProtectSystem=strict


Follow the comments above and create an override file with this content:
[Service]
ProtectControlGroups=no
ProtectHome=no
ProtectSystem=full

Then reboot and try a fresh Theory native task.
Let's see if this solves it.


As for CVMFS you may look at this post:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594&postid=48539
9) Message boards : Theory Application : Suspend/Resume (Message 8397)
Posted 20 days ago by computezrmle
Post:
time="2024-04-08T01:09:09Z" level=error msg="runc run failed: fchown fd 7: operation not permitted"

This is primarily an error reported by runc.

Please post the output of the command below to ensure it is not caused as a side effect of a too strict systemd setting.
systemctl --no-pager show boinc-client |grep -i protect



In addition:
- You use BOINC version 7.18.1 which was never intended for Linux.
- You upgraded sudo (Ubuntu 22.04 originally came with sudo < 1.9.10)
- Your CVMFS configuration does not follow the latest suggestions (see your logs)
10) Message boards : Theory Application : Suspend/Resume (Message 8394)
Posted 21 days ago by computezrmle
Post:
Your Ubuntu box does not support virtualization, hence you run Theory native.
Theory native doesn't support suspend/resume out of the box.
There are 2 possible methods to enable it:

1.
Use the traditional method together with cgroups v1.
This must be prepared as described in the prod forum.
But this method is deprecated as most Linux distributions now use cgroups v2 as default.

2.
Use cgroups v2 (plus sudo >= v1.9.10)
Your sudo version is 1.9.10, but you did not install the required sudoers file '/etc/sudoers.d/50-lhcathome_boinc_theory_native'.
Check the relevant forum posts and your logfiles.
For method (2.) settings from method (1.) must not be used. Disable/uninstall them.


It looks like you are currently running a mix of both methods but none of them is correctly set up.
As a result the tasks ignore the pause/resume signals and are still running even if BOINC shows them as suspended.
11) Message boards : Theory Application : Suspend/Resume (Message 8391)
Posted 22 days ago by computezrmle
Post:
Please make your computers visible for other volunteers here:
https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project
12) Message boards : Theory Application : Veeerrrry long Pythia8 (Message 8318)
Posted 1 Feb 2024 by computezrmle
Post:
I don't understand why ... I just removed them and this time I got a result

Well, that's your problem!
You checked for the first line showing "processed" instead of the last line.


This forum is not the place to give basic lessons.
Neither for Linux nor for Windows.
If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything.

Instead of:
sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log

run this:
sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log
13) Message boards : Theory Application : Veeerrrry long Pythia8 (Message 8316)
Posted 1 Feb 2024 by computezrmle
Post:
Why "sudo"?
Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you?

My 1st command is already explained:
Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing.


The 2nd should be self explaining as it uses basic Linux commands.
Make yourself familiar with those commands as they are widely used.
Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log".
The result should look like:
20100 events processed

A typical 1st line in "runRivet.log" looks like:
===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78]
The bold number tells you how many events are to be processed.
Here: 100000

So, roughly 20% of the task is done.
Now, look at the task's walltime, say (example): 21 hours

=> This task has another 84 hours to go.
=> It will finish before the 10-days-limit.


If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy.
14) Message boards : Theory Application : Veeerrrry long Pythia8 (Message 8314)
Posted 1 Feb 2024 by computezrmle
Post:
Sherpas sometimes get stuck in endless loops.


Check the "runRivet.log" from that task.

Has something been written to the log within the last "mmin" minutes (720 min = 12 h)?
Be aware! This command will also show logs from tasks that are suspended for roughly the same time span.
find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}"



Now, use the path to the log from above for the next command.
This shows how many events are already processed (or nothing).
grep -m1 'processed' <(tac /path_to_your_logfile)

Compare that number with the task's total events and the actual runtime of the task.
This allows to calculate the estimated total runtime.
If that is far more than 10 days the task will not finish within the deadline and should be cancelled.

Be aware! Do not use the runtime estimates shown by BOINC.
BOINC doesn't know anything about the internal structure/logs of Theory tasks.
15) Message boards : General Discussion : Fetching configuration file (Message 8294)
Posted 19 Jan 2024 by computezrmle
Post:
I wonder where the URL "https://lhcathomedev.cern.ch/lhc-dev" comes from (it was you who posted it).
Please
- open a console on the computer that fails to connect
- change to the BOINC working directory
- run the following command
- post the output

find . -maxdepth 1 -name "*.xml" |xargs -I {} grep -H "://lhcathomedev.cern.ch/lhc-dev"
16) Message boards : General Discussion : Fetching configuration file (Message 8292)
Posted 18 Jan 2024 by computezrmle
Post:
I'm trying to add LHC-Dev to a linux machine (on VirtualBox) to test xtrack app
But i have this message: "Fetching configuration file from https://lhcathomedev.cern.ch/lhc-dev" and boinc manager stucks

I've tried 3 different distos
I've tried to connect to other projects without problems

Might be you used the wrong master URL.
Try this one:
https://lhcathomedev.cern.ch/lhcathome-dev/
17) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8269)
Posted 3 Jan 2024 by computezrmle
Post:
Since David Cameron left CERN there's no BOINC development for ATLAS.

The tasks you get are created by an automatic loop and contain just a few test events.
Once the queue is empty the same tasks are created again.
Even if a task returns valid results those are not used in a scientific matter.

Decide yourself whether it makes sense to run ATLAS -dev until CERN officially resumes development here.
18) Message boards : Theory Application : All errors (Message 8262)
Posted 23 Dec 2023 by computezrmle
Post:
Looks like the vdi file on -prod is a simple copy of the vdi file on -dev without a fresh UUID.

If -prod and -dev run under the same username (even if that user runs multiple BOINC clients) VirtualBox registers the vdi that comes first.
Hence, if -dev comes first you get the error at -prod and vice versa.

A fresh UUID must be set at CERN when the vdi/app moves from -dev to -prod.
19) Message boards : Theory Application : New native version v5.94 (Message 8245)
Posted 5 Dec 2023 by computezrmle
Post:
On the repository cernvm-prod.cern.ch there are cvm3 and cvm4.
20) Message boards : Theory Application : New native version v5.94 (Message 8243)
Posted 5 Dec 2023 by computezrmle
Post:
Either run a recent Linux with cgroups v2 plus the sudo modification.
That's the recommended way as it delegates the cgroup handling used for suspend/resume to systemd.

Or disable cgroups v2 and modify the system according to the old suggestions on -prod.
Theory runs without those modifications as the are only there to allow suspend/resume independend from systemd.


Next 20


©2024 CERN