1)
Message boards :
CMS Application :
New Version v61.25
(Message 8511)
Posted 6 days ago by computezrmle Post: Looks good on Windows and Linux. |
2)
Message boards :
Theory Application :
New Version v6.02
(Message 8510)
Posted 6 days ago by computezrmle Post: Can't be tested due to missing tasks in the queue. |
3)
Message boards :
CMS Application :
Failures to contact CMS-Factory
(Message 8497)
Posted 11 Sep 2024 by computezrmle Post: Factory has been on vocms0205.cern.ch in the past but was recently moved to vocms0204.cern.ch. The CMS VM still includes a check for vocms0205 (called by bootstrap). Laurence created a patch to replace that check without the need to distribute a completely new vdi/app version. Looks like that patch doesn't always work reliable on your machine - reason unknown. Best long term solution would be to create a fresh vdi with a simplified bootstrap (outsourced network checks). I'm having some real problems with a new security regime at my institution, so this could possibly be a manifestation of that. I don't think so. The test fails simply because vocms0205.cern.ch doesn't exist any more. |
4)
Message boards :
CMS Application :
Problem with upgrade of BOINC server
(Message 8476)
Posted 7 Jun 2024 by computezrmle Post: Calm down. And try to find back to reality. You lost contact to it long ago. |
5)
Message boards :
CMS Application :
Problem with upgrade of BOINC server
(Message 8473)
Posted 7 Jun 2024 by computezrmle Post: Complaints about credits are useless as long as the logs show a success like: 2024-06-06 16:23:23 (6020): VM Completion Message: glidein exited with return value 0. . . . 2024-06-06 16:23:29 (6020): called boinc_finish(0) For more than a decade credit calculation is done by the BOINC server following a well defined algorithm: https://boinc.berkeley.edu/trac/wiki/CreditNew The bad thing is that the algorithm partly uses it's output to modify input parameters used for the next calculation. This is a "per computer/per project" property, hence the results can't be compared with other computers of the same type or between projects (like dev/prod). It's all fine as long as the computer runs series of tasks under the same load with runtimes as close together as possible. Factors disturbing the balance (examples): - running the same task type under low/full load - series of long running tasks followed by series of short running tasks and vice versa - periods of empty backend queues (since for LHC@home BOINC treats empty envelopes as very short valids) - a mix of singlecore and multicore tasks Once a balance has been found, e.g. during a period of an empty queue, a refilled queue can result in credits being way off (even by several magnitudes) and it takes lots of returned valids to slowly get back to the mean (but it happens automatically). Users then tend to complain about low credits but never about "too much" credit. |
6)
Message boards :
General Discussion :
Vbox_image
(Message 8450)
Posted 17 May 2024 by computezrmle Post: LHC@home uses VirtualBox VMs based on "cernvm". See: https://cernvm.cern.ch/appliance/ |
7)
Message boards :
CMS Application :
CMS multi-core
(Message 8437)
Posted 28 Apr 2024 by computezrmle Post: Ivan(at prod) wrote: At present we have two workflows running. One is set to run 503,000 events/job (as was the template it was derived from) and takes about 5-6 hours wall-time. The other is set to 50,000 events/job and runs about one hour clock time. If we run out of jobs before the weekend, I'll submit a batch with 100,000 events/job, to match the 2-hour average our previous tasks took. These jobs generate considerably less output per CPU-hour than our previous ones. Might be a result of the different batches with different event counts. Unfortunately we still can't look into the job details to see what kind of job a VM is currently running. |
8)
Message boards :
CMS Application :
CMS multi-core
(Message 8435)
Posted 28 Apr 2024 by computezrmle Post: Some computers running CMS tasks only run empty envelopes. Unfortunately their user(s) did not make them visible for others, hence can't be directly informed. Please check the following settings: - recent CMS tasks require at least a computer reporting 4 cores - computers reporting less than 4 cores (e.g. just 1) should not run recent CMS tasks Examples: https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4980 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4800 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4803 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4810 |
9)
Message boards :
General Discussion :
Vboxwrapper race mitigation
(Message 8434)
Posted 28 Apr 2024 by computezrmle Post: Vboxwrapper used for CMS v61.01 and Theory 6.01 introduces a short living lock to protect multiattach disk operations against race conditions. Those can happen when no task from a distinct subproject is running and a BOINC client then starts a couple of them concurrently. Root cause for the race condition is a design decision in VirtualBox that does not allow to attach a 'mutiattach' virtual disk within 1 step. Vboxwrapper currently used here reports to be v26207. In fact, it is a nightly build from BOINC on github that is based on v26207 but includes the relevant PRs 5571 and 5598. Once available the final version will report an updated version number, most likely 26208. During normal operation there's no difference between v26206 (used before) and v26207+. Relevant information can be found in stderr.txt. It looks like this: 2024-04-27 08:53:41 (15636): Adding virtual disk drive to VM. (Theory_2024_04_26_dev.vdi) 2024-04-27 08:53:48 (15636): Attempts: 5 The attempts line is suppressed if the lock can be set at the 1st attempt. Otherwise it shows how often a vboxwrapper instance had to go through the 'lock acquire' loop until it could get the lock. Each vboxwrapper that can't get the lock sleeps for a short period of time until the next attempt. A timeout (currently 90 s) avoids an endless loop. Messages like this are still included in vbox_trace.txt to indicate they can be identified: Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: -2135228409 Output: VBoxManage: error: Cannot attach medium '/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 781 of file VBoxManageStorageController.cpp The cleanup follows immediately and looks like this. Here, vboxwrapper also automatically removes 3 child disk orphans: 2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e Parent UUID: base State: created Type: multiattach Location: /home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi Storage format: VDI Format variant: dynamic default Capacity: 20480 MBytes Size on disk: 781 MBytes Encryption: disabled Property: AllocationBlockSize=1048576 Child UUIDs: 4a8a5d30-405d-4666-89ee-fd139047c9c5 d6e4263f-9efb-40ce-9193-c5a2d39cf558 ea7a4528-2a03-48a8-9de4-280c6eb77d8c 2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "4a8a5d30-405d-4666-89ee-fd139047c9c5" Exit Code: 0 Output: UUID: 4a8a5d30-405d-4666-89ee-fd139047c9c5 Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:41 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "d6e4263f-9efb-40ce-9193-c5a2d39cf558" Exit Code: 0 Output: UUID: d6e4263f-9efb-40ce-9193-c5a2d39cf558 Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:42 (15633): Command: VBoxManage -q showhdinfo "ea7a4528-2a03-48a8-9de4-280c6eb77d8c" Exit Code: 0 Output: UUID: ea7a4528-2a03-48a8-9de4-280c6eb77d8c Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: After cleanup vboxwrapper can attach the vdi and set the 'multiattach' flag: 2024-04-27 08:53:42 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: 2024-04-27 08:53:43 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium none Exit Code: 0 Output: 2024-04-27 08:53:43 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: The same lock is also set when vboxwrapper deregisters a VM that uses a multiattach disk. Although concurrent operations are rare, they may appear as 'Attempts: n' close to the end of stderr.txt. |
10)
Message boards :
Theory Application :
New version v6.00
(Message 8426)
Posted 26 Apr 2024 by computezrmle Post: Theory 6.00 and 6.01 are there to test vboxwrapper modifications. The modifications address 2 errors: 1. "the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later" 2. "Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ Theory_2023_12_13.vdi' because it has 2 child media" Other errors, especially those returned from CVMFS or deeper level scripts are out of scope. Details can be found here: https://github.com/BOINC/boinc/pull/5571 |
11)
Message boards :
Theory Application :
New version v6.00
(Message 8422)
Posted 26 Apr 2024 by computezrmle Post: This is wrong in Theory_2024_04_26_dev.xml: <multiattach_vdi_file>Theory_2024_04_26_dev.xml</multiattach_vdi_file> Should be: <multiattach_vdi_file>Theory_2024_04_26_dev.vdi</multiattach_vdi_file> |
12)
Message boards :
CMS Application :
CMS multi-core
(Message 8412)
Posted 16 Apr 2024 by computezrmle Post: Unlike VirtualBox VMWare is out of scope for CMS multi-core. Hence, discussing VirtualBox settings may be useful while a discussion about VMWare vs. VirtualBox is not. It just moves the focus off. |
13)
Message boards :
CMS Application :
CMS multi-core
(Message 8410)
Posted 16 Apr 2024 by computezrmle Post: Your Ryzen 7 1700 has 8 physical cores and 16 logical cores. Your 1st log shows that you configured a 15-core VM (meanwhile you use 4-core VMs): https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3314981 2024-04-02 11:41:10 (6376): Setting CPU Count for VM. (15) 15-core VMs should not be configured on an 8 core (physical cores) computer. Instead, each VM should not exceed the number of physical cores. See a detailed comment about that here: https://forums.virtualbox.org/viewtopic.php?t=77413 I suggest to respect that limit to avoid issues being introduced in any test here that have nothing to do with CERN. |
14)
Message boards :
Theory Application :
Suspend/Resume
(Message 8399)
Posted 8 Apr 2024 by computezrmle Post: The following comments are from the original BOINC service file on github: # The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true #PrivateTmp=true #Block X11 idle detection On your system the security policy is too strict and inhibits communication among relevant services required for ATLAS/Theory native. ProtectControlGroups=yes ProtectHome=yes ProtectSystem=strict Follow the comments above and create an override file with this content: [Service] ProtectControlGroups=no ProtectHome=no ProtectSystem=full Then reboot and try a fresh Theory native task. Let's see if this solves it. As for CVMFS you may look at this post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594&postid=48539 |
15)
Message boards :
Theory Application :
Suspend/Resume
(Message 8397)
Posted 8 Apr 2024 by computezrmle Post: time="2024-04-08T01:09:09Z" level=error msg="runc run failed: fchown fd 7: operation not permitted" This is primarily an error reported by runc. Please post the output of the command below to ensure it is not caused as a side effect of a too strict systemd setting. systemctl --no-pager show boinc-client |grep -i protect In addition: - You use BOINC version 7.18.1 which was never intended for Linux. - You upgraded sudo (Ubuntu 22.04 originally came with sudo < 1.9.10) - Your CVMFS configuration does not follow the latest suggestions (see your logs) |
16)
Message boards :
Theory Application :
Suspend/Resume
(Message 8394)
Posted 7 Apr 2024 by computezrmle Post: Your Ubuntu box does not support virtualization, hence you run Theory native. Theory native doesn't support suspend/resume out of the box. There are 2 possible methods to enable it: 1. Use the traditional method together with cgroups v1. This must be prepared as described in the prod forum. But this method is deprecated as most Linux distributions now use cgroups v2 as default. 2. Use cgroups v2 (plus sudo >= v1.9.10) Your sudo version is 1.9.10, but you did not install the required sudoers file '/etc/sudoers.d/50-lhcathome_boinc_theory_native'. Check the relevant forum posts and your logfiles. For method (2.) settings from method (1.) must not be used. Disable/uninstall them. It looks like you are currently running a mix of both methods but none of them is correctly set up. As a result the tasks ignore the pause/resume signals and are still running even if BOINC shows them as suspended. |
17)
Message boards :
Theory Application :
Suspend/Resume
(Message 8391)
Posted 5 Apr 2024 by computezrmle Post: Please make your computers visible for other volunteers here: https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project |
18)
Message boards :
Theory Application :
Veeerrrry long Pythia8
(Message 8318)
Posted 1 Feb 2024 by computezrmle Post: I don't understand why ... I just removed them and this time I got a result Well, that's your problem! You checked for the first line showing "processed" instead of the last line. This forum is not the place to give basic lessons. Neither for Linux nor for Windows. If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything. Instead of: sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log run this: sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log |
19)
Message boards :
Theory Application :
Veeerrrry long Pythia8
(Message 8316)
Posted 1 Feb 2024 by computezrmle Post: Why "sudo"? Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you? My 1st command is already explained: Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing. The 2nd should be self explaining as it uses basic Linux commands. Make yourself familiar with those commands as they are widely used. Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log". The result should look like: 20100 events processed A typical 1st line in "runRivet.log" looks like: ===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78] The bold number tells you how many events are to be processed. Here: 100000 So, roughly 20% of the task is done. Now, look at the task's walltime, say (example): 21 hours => This task has another 84 hours to go. => It will finish before the 10-days-limit. If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy. |
20)
Message boards :
Theory Application :
Veeerrrry long Pythia8
(Message 8314)
Posted 1 Feb 2024 by computezrmle Post: Sherpas sometimes get stuck in endless loops. Check the "runRivet.log" from that task. Has something been written to the log within the last "mmin" minutes (720 min = 12 h)? Be aware! This command will also show logs from tasks that are suspended for roughly the same time span. find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}" Now, use the path to the log from above for the next command. This shows how many events are already processed (or nothing). grep -m1 'processed' <(tac /path_to_your_logfile) Compare that number with the task's total events and the actual runtime of the task. This allows to calculate the estimated total runtime. If that is far more than 10 days the task will not finish within the deadline and should be cancelled. Be aware! Do not use the runtime estimates shown by BOINC. BOINC doesn't know anything about the internal structure/logs of Theory tasks. |
©2024 CERN