Posts by computezrmle

Info	Message
41) Message boards : CMS Application : Problem with upgrade of BOINC server Message 8473 Posted 7 Jun 2024 by computezrmle	ints about credits are useless as long as the logs show a success like: [pre]2024-06-06 16:23:23 (6020): VM Completion Message: glidein exited with return value 0. . . . 2024-06-06 16:23:29 (6020): called boinc_finish(0)[/pre] For more than a decade credit calculation is done by the BOINC server following a well defined algorithm: https://boinc.berkeley.edu/trac/wiki/CreditNew The bad thing is that the algorithm partly uses it's output to modify input parameters used for the next calculation. This is a "per computer/per project" property, hence the results can't be compared with other computers of the same type or between projects (like dev/prod). It's all fine as long as the computer runs series of tasks under the same load with runtimes as close together as possible. Factors disturbing the balance (examples): - running the same task type under low/full load - series of long running tasks followed by series of short running tasks and vice versa - periods of empty backend queues (since for LHC@home BOINC treats empty envelopes as very short valids) - a mix of singlecore and multicore tasks Once a balance has been found, e.g. during a period of an empty queue, a refilled queue can result in credits being way off (even by several magnitudes) and it takes lots of returned valids to slowly get back to the mean (but it happens automatically). Users then tend to complain about low credits but never about "too much" credit.
42) Message boards : General Discussion : Vbox_image Message 8450 Posted 17 May 2024 by computezrmle	LHC@home uses VirtualBox VMs based on "cernvm". See: https://cernvm.cern.ch/appliance/
43) Message boards : CMS Application : CMS multi-core Message 8437 Posted 28 Apr 2024 by computezrmle	Ivan(at prod) wrote: At present we have two workflows running. One is set to run 503,000 events/job (as was the template it was derived from) and takes about 5-6 hours wall-time. The other is set to 50,000 events/job and runs about one hour clock time. If we run out of jobs before the weekend, I'll submit a batch with 100,000 events/job, to match the 2-hour average our previous tasks took. These jobs generate considerably less output per CPU-hour than our previous ones. Might be a result of the different batches with different event counts. Unfortunately we still can't look into the job details to see what kind of job a VM is currently running.
44) Message boards : CMS Application : CMS multi-core Message 8435 Posted 28 Apr 2024 by computezrmle	Some computers running CMS tasks only run empty envelopes. Unfortunately their user(s) did not make them visible for others, hence can't be directly informed. Please check the following settings: - recent CMS tasks require at least a computer reporting 4 cores - computers reporting less than 4 cores (e.g. just 1) should not run recent CMS tasks Examples: https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4980 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4800 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4803 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4810
45) Message boards : General Discussion : Vboxwrapper race mitigation Message 8434 Posted 28 Apr 2024 by computezrmle	apper used for CMS v61.01 and Theory 6.01 introduces a short living lock to protect multiattach disk operations against race conditions. Those can happen when no task from a distinct subproject is running and a BOINC client then starts a couple of them concurrently. Root cause for the race condition is a design decision in VirtualBox that does not allow to attach a 'mutiattach' virtual disk within 1 step. Vboxwrapper currently used here reports to be v26207. In fact, it is a nightly build from BOINC on github that is based on v26207 but includes the relevant PRs 5571 and 5598. Once available the final version will report an updated version number, most likely 26208. During normal operation there's no difference between v26206 (used before) and v26207+. Relevant information can be found in stderr.txt. It looks like this: [pre]2024-04-27 08:53:41 (15636): Adding virtual disk drive to VM. (Theory_2024_04_26_dev.vdi) 2024-04-27 08:53:48 (15636): Attempts: 5[/pre] The attempts line is suppressed if the lock can be set at the 1st attempt. Otherwise it shows how often a vboxwrapper instance had to go through the 'lock acquire' loop until it could get the lock. Each vboxwrapper that can't get the lock sleeps for a short period of time until the next attempt. A timeout (currently 90 s) avoids an endless loop. Messages like this are still included in vbox_trace.txt to indicate they can be identified: [pre]Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: -2135228409 Output: VBoxManage: error: Cannot attach medium '/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 781 of file VBoxManageStorageController.cpp[/pre] The cleanup follows immediately and looks like this. Here, vboxwrapper also automatically removes 3 child disk orphans: [pre]2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e Parent UUID: base State: created Type: multiattach Location: /home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi Storage format: VDI Format variant: dynamic default Capacity: 20480 MBytes Size on disk: 781 MBytes Encryption: disabled Property: AllocationBlockSize=1048576 Child UUIDs: 4a8a5d30-405d-4666-89ee-fd139047c9c5 d6e4263f-9efb-40ce-9193-c5a2d39cf558 ea7a4528-2a03-48a8-9de4-280c6eb77d8c 2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "4a8a5d30-405d-4666-89ee-fd139047c9c5" Exit Code: 0 Output: UUID: 4a8a5d30-405d-4666-89ee-fd139047c9c5 Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:41 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/1/boinc_foobar0816/Snapshots/{4a8a5d30-405d-4666-89ee-fd139047c9c5}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:41 (15633): Command: VBoxManage -q showhdinfo "d6e4263f-9efb-40ce-9193-c5a2d39cf558" Exit Code: 0 Output: UUID: d6e4263f-9efb-40ce-9193-c5a2d39cf558 Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/2/boinc_foobar0817/Snapshots/{d6e4263f-9efb-40ce-9193-c5a2d39cf558}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:42 (15633): Command: VBoxManage -q showhdinfo "ea7a4528-2a03-48a8-9de4-280c6eb77d8c" Exit Code: 0 Output: UUID: ea7a4528-2a03-48a8-9de4-280c6eb77d8c Parent UUID: 09e7e89e-310f-45d3-b402-27d8c420e14e State: inaccessible Access Error: Could not open the medium '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi'. VD: error VERR_FILE_NOT_FOUND opening image file '/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi' (VERR_FILE_NOT_FOUND) Type: normal (differencing) Auto-Reset: off Location: /home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi Storage format: VDI Format variant: dynamic default Capacity: 0 MBytes Size on disk: 0 MBytes Encryption: disabled Property: AllocationBlockSize= 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/slots/3/boinc_foobar0818/Snapshots/{ea7a4528-2a03-48a8-9de4-280c6eb77d8c}.vdi" Exit Code: 0 Output: 2024-04-27 08:53:42 (15633): Command: VBoxManage -q closemedium disk "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output:[/pre] After cleanup vboxwrapper can attach the vdi and set the 'multiattach' flag: [pre]2024-04-27 08:53:42 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output: 2024-04-27 08:53:43 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --medium none Exit Code: 0 Output: 2024-04-27 08:53:43 (15633): Command: VBoxManage -q storageattach "boinc_41406568c2ab1634" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/home/boinc9/BOINC_TEST/projects/lhcathomedev.cern.ch_lhcathome-dev/Theory_2024_04_26_dev.vdi" Exit Code: 0 Output:[/pre] The same lock is also set when vboxwrapper deregisters a VM that uses a multiattach disk. Although concurrent operations are rare, they may appear as 'Attempts: n' close to the end of stderr.txt.
46) Message boards : Theory Application : New version v6.00 Message 8426 Posted 26 Apr 2024 by computezrmle	Theory 6.00 and 6.01 are there to test vboxwrapper modifications. The modifications address 2 errors: 1. "the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later" 2. "Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ Theory_2023_12_13.vdi' because it has 2 child media" Other errors, especially those returned from CVMFS or deeper level scripts are out of scope. Details can be found here: https://github.com/BOINC/boinc/pull/5571
47) Message boards : Theory Application : New version v6.00 Message 8422 Posted 26 Apr 2024 by computezrmle	This is wrong in Theory_2024_04_26_dev.xml: <multiattach_vdi_file>Theory_2024_04_26_dev.xml</multiattach_vdi_file> Should be: <multiattach_vdi_file>Theory_2024_04_26_dev.vdi</multiattach_vdi_file>
48) Message boards : CMS Application : CMS multi-core Message 8412 Posted 16 Apr 2024 by computezrmle	Unlike VirtualBox VMWare is out of scope for CMS multi-core. Hence, discussing VirtualBox settings may be useful while a discussion about VMWare vs. VirtualBox is not. It just moves the focus off.
49) Message boards : CMS Application : CMS multi-core Message 8410 Posted 16 Apr 2024 by computezrmle	yzen 7 1700 has 8 physical cores and 16 logical cores. Your 1st log shows that you configured a 15-core VM (meanwhile you use 4-core VMs): https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3314981 [pre]2024-04-02 11:41:10 (6376): Setting CPU Count for VM. (15)[/pre] 15-core VMs should not be configured on an 8 core (physical cores) computer. Instead, each VM should not exceed the number of physical cores. See a detailed comment about that here: https://forums.virtualbox.org/viewtopic.php?t=77413 I suggest to respect that limit to avoid issues being introduced in any test here that have nothing to do with CERN.
50) Message boards : Theory Application : Suspend/Resume Message 8399 Posted 8 Apr 2024 by computezrmle	llowing comments are from the original BOINC service file on github: [pre]# The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true #PrivateTmp=true #Block X11 idle detection[/pre] On your system the security policy is too strict and inhibits communication among relevant services required for ATLAS/Theory native. [pre]ProtectControlGroups=yes ProtectHome=yes ProtectSystem=strict[/pre] Follow the comments above and create an override file with this content: [pre][Service] ProtectControlGroups=no ProtectHome=no ProtectSystem=full[/pre] Then reboot and try a fresh Theory native task. Let's see if this solves it. As for CVMFS you may look at this post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5594&postid=48539
51) Message boards : Theory Application : Suspend/Resume Message 8397 Posted 8 Apr 2024 by computezrmle	]time="2024-04-08T01:09:09Z" level=error msg="runc run failed: fchown fd 7: operation not permitted"[/quote] This is primarily an error reported by runc. Please post the output of the command below to ensure it is not caused as a side effect of a too strict systemd setting. [pre]systemctl --no-pager show boinc-client \|grep -i protect[/pre] In addition: - You use BOINC version 7.18.1 which was never intended for Linux. - You upgraded sudo (Ubuntu 22.04 originally came with sudo < 1.9.10) - Your CVMFS configuration does not follow the latest suggestions (see your logs)
52) Message boards : Theory Application : Suspend/Resume Message 8394 Posted 7 Apr 2024 by computezrmle	Your Ubuntu box does not support virtualization, hence you run Theory native. Theory native doesn't support suspend/resume out of the box. There are 2 possible methods to enable it: 1. Use the traditional method together with cgroups v1. This must be prepared as described in the prod forum. But this method is deprecated as most Linux distributions now use cgroups v2 as default. 2. Use cgroups v2 (plus sudo >= v1.9.10) Your sudo version is 1.9.10, but you did not install the required sudoers file '/etc/sudoers.d/50-lhcathome_boinc_theory_native'. Check the relevant forum posts and your logfiles. For method (2.) settings from method (1.) must not be used. Disable/uninstall them. It looks like you are currently running a mix of both methods but none of them is correctly set up. As a result the tasks ignore the pause/resume signals and are still running even if BOINC shows them as suspended.
53) Message boards : Theory Application : Suspend/Resume Message 8391 Posted 5 Apr 2024 by computezrmle	Please make your computers visible for other volunteers here: https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project
54) Message boards : Theory Application : Veeerrrry long Pythia8 Message 8318 Posted 1 Feb 2024 by computezrmle	I don't understand why ... I just removed them and this time I got a result Well, that's your problem! You checked for the first line showing "processed" instead of the last line. This forum is not the place to give basic lessons. Neither for Linux nor for Windows. If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything. Instead of: sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log run this: sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log
55) Message boards : Theory Application : Veeerrrry long Pythia8 Message 8316 Posted 1 Feb 2024 by computezrmle	Why "sudo"? Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you? My 1st command is already explained: Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing. The 2nd should be self explaining as it uses basic Linux commands. Make yourself familiar with those commands as they are widely used. Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log". The result should look like: 20100 events processed A typical 1st line in "runRivet.log" looks like: ===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78] The bold number tells you how many events are to be processed. Here: 100000 So, roughly 20% of the task is done. Now, look at the task's walltime, say (example): 21 hours => This task has another 84 hours to go. => It will finish before the 10-days-limit. If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy.
56) Message boards : Theory Application : Veeerrrry long Pythia8 Message 8314 Posted 1 Feb 2024 by computezrmle	s sometimes get stuck in endless loops. Check the "runRivet.log" from that task. Has something been written to the log within the last "mmin" minutes (720 min = 12 h)? Be aware! This command will also show logs from tasks that are suspended for roughly the same time span. [pre]find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 \|xargs -I {} bash -c "head -n1 {}; ls -hal {}"[/pre] Now, use the path to the log from above for the next command. This shows how many events are already processed (or nothing). [pre]grep -m1 'processed' <(tac /path_to_your_logfile)[/pre] Compare that number with the task's total events and the actual runtime of the task. This allows to calculate the estimated total runtime. If that is far more than 10 days the task will not finish within the deadline and should be cancelled. Be aware! Do not use the runtime estimates shown by BOINC. BOINC doesn't know anything about the internal structure/logs of Theory tasks.
57) Message boards : General Discussion : Fetching configuration file Message 8294 Posted 19 Jan 2024 by computezrmle	er where the URL "https://lhcathomedev.cern.ch/lhc-dev" comes from (it was you who posted it). Please - open a console on the computer that fails to connect - change to the BOINC working directory - run the following command - post the output [pre]find . -maxdepth 1 -name "*.xml" \|xargs -I {} grep -H "://lhcathomedev.cern.ch/lhc-dev"[/pre]
58) Message boards : General Discussion : Fetching configuration file Message 8292 Posted 18 Jan 2024 by computezrmle	I'm trying to add LHC-Dev to a linux machine (on VirtualBox) to test xtrack app But i have this message: "Fetching configuration file from https://lhcathomedev.cern.ch/lhc-dev" and boinc manager stucks I've tried 3 different distos I've tried to connect to other projects without problems Might be you used the wrong master URL. Try this one: https://lhcathomedev.cern.ch/lhcathome-dev/
59) Message boards : ATLAS Application : ATLAS vbox and native 3.01 Message 8269 Posted 3 Jan 2024 by computezrmle	Since David Cameron left CERN there's no BOINC development for ATLAS. The tasks you get are created by an automatic loop and contain just a few test events. Once the queue is empty the same tasks are created again. Even if a task returns valid results those are not used in a scientific matter. Decide yourself whether it makes sense to run ATLAS -dev until CERN officially resumes development here.
60) Message boards : Theory Application : All errors Message 8262 Posted 23 Dec 2023 by computezrmle	Looks like the vdi file on -prod is a simple copy of the vdi file on -dev without a fresh UUID. If -prod and -dev run under the same username (even if that user runs multiple BOINC clients) VirtualBox registers the vdi that comes first. Hence, if -dev comes first you get the error at -prod and vice versa. A fresh UUID must be set at CERN when the vdi/app moves from -dev to -prod.

Development for LHC@home