1) Message boards : Theory Application : All errors (Message 7912)
Posted 25 days ago by computezrmle
Post:
You may have removed the VM entries but you obviously forgot to remove the disk entries.
Use the VirtualBox Media Manager to do this.
2) Message boards : Theory Application : All errors (Message 7910)
Posted 25 days ago by computezrmle
Post:
You need to manually clean the VirtualBox media registry.


1. Ensure no Theory task from -dev is running or paused
2. Stop BOINC
3. Open the VirtualBox GUI and start the Media Manager
4. Remove all child vdis connected to parent a9d19666-9f42-47d2-9b06-f58d52e3215c
5. Remove all files/folders from the slots where Theory -dev tasks had been.
6. Remove the parent vdi a9d19666-9f42-47d2-9b06-f58d52e3215c (but not it's file within the projects folder).
7. Restart BOINC

Ensure not to start too many tasks concurrently that try to initialize the same vdi as multiattach parent as this may cause a race condition.
This is caused by a workaraound vboxwrapper must use to correctly register a vdi in multiattach mode.
3) Message boards : ATLAS Application : ATLAS vbox v.1.27 (Message 7908)
Posted 8 Dec 2022 by computezrmle
Post:
This will happen occasionally.
It is caused by VirtualBox's vdi handling which moves the media entry back and forth between different configuration files.
Version 7.x makes it even worse compared to v6.1 as it violates it's own rule not to allow the multiattach attribute within the global media registry.

So far the only solution is to manually clear orphaned media entries.
4) Message boards : News : Server Release 1.4.0 (Message 7906)
Posted 3 Dec 2022 by computezrmle
Post:
Theory validator is not running.
https://lhcathomedev.cern.ch/lhcathome-dev/server_status.php
5) Message boards : ATLAS Application : ATLAS vbox v.1.26 (Message 7893)
Posted 25 Nov 2022 by computezrmle
Post:
It's just an info message, not even a warning.
And it's written by the VirtualBox core.

Should go away if the host computer's RTC and every OS (host/guests) is set to use UTC.
More details can be found in this PR:
https://github.com/BOINC/boinc/pull/4631

Vboxwrapper 26206 used here includes that patch but the clock and the regkey must correctly be set once by the user.
6) Message boards : General Discussion : Got 0 new tasks (Message 7880)
Posted 11 Nov 2022 by computezrmle
Post:
As for ATLAS:
On the prefs page (https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project) disable "Run native if available?"

As for GPU work:
The xtrack queue is empty.
AMD/ATI is an old and well known server issue to send this message if none of the subprojects sends any work.
7) Message boards : News : Server Release 1.4.0 (Message 7876)
Posted 10 Nov 2022 by computezrmle
Post:
Did you set this in your BOINC client's cc_config.xml?
<dont_check_file_sizes>1</dont_check_file_sizes>
8) Message boards : ATLAS Application : ATLAS vbox v.1.17 (Message 7868)
Posted 8 Nov 2022 by computezrmle
Post:
Theory and CMS are already updated.
ATLAS may follow next week I guess.
9) Message boards : CMS Application : New Version 60.70 (Message 7866)
Posted 7 Nov 2022 by computezrmle
Post:
7.x might work, but since it is rather new you may stumble over unexpected issues.

I'm already aware of a modification that affects the media manager.
So far it's not a show stopper here but it needs a closer look.
10) Message boards : CMS Application : New Version 60.70 (Message 7865)
Posted 7 Nov 2022 by computezrmle
Post:
Some kind of a "bootstrap preloader" that does this:
1. mount the shared folder
2. parse init_data.xml from there (it tells you whether you are in dev or prod)
3. modify the link to the main bootstrap script in /sbin according to (2.)
4. mount grid.cern.ch (the link points to this repo; unlike ATLAS which gets it's boot script from atlas.cern.ch)
5. execute the main bootstrap script on CVMFS


This is very close to the ATLAS script.
I'll prepare a suggestion.
11) Message boards : CMS Application : New Version 60.70 (Message 7863)
Posted 7 Nov 2022 by computezrmle
Post:
Your CVMFS requests from Theory tasks are also sent to *.openhtc.io.
Surely to the same Cloudflare datacenter.
Could even be that CVMFS and frontier requests are processed by the very same Cloudflare Squid instance there.

Those Squids get their data from backend systems at CERN, RAL, Fermilab ... and do an automatic fail-over.
Very unlikely that all backend systems are down at the same moment.
Especially since this would crash nearly all CMS tasks worldwide.


Please upgrade VirtualBox to the recent v6.1.
The new vboxwrapper 26206 does not have the .com interface any more which was responsible for problems in the past.
12) Message boards : CMS Application : New Version 60.70 (Message 7860)
Posted 7 Nov 2022 by computezrmle
Post:
*.openhtc.io responds to both, ipv4 and ipv6.

from your screenshot -> 188.114.96.1
At least this one should have worked.

It is clearly reported by the frontier client.
Was it a transient error (1 task only) or do all tasks report it?

Could you check whether Cloudflare is blocked in your firewall (maybe only for this box)?
13) Message boards : CMS Application : New Version 60.70 (Message 7859)
Posted 7 Nov 2022 by computezrmle
Post:
@Laurence
The CMS vdi currently in use sets this link in /usr/sbin/bootstrap:
/cvmfs/grid.cern.ch/vc/vm-qa/sbin/bootstrap-idtoken

bootstrap-idtoken sets this variable:
branch=qa

If it was the intention to implement a switch dev/prod it will not work.


I recently implemented a switch for ATLAS that solves the same objective:
https://github.com/davidgcameron/boinc-scripts/blob/master/vbox/ATLASbootstrap.sh#L41-L45

It could easily be rewritten and tested for CMS.
Just give me a "go".
14) Message boards : CMS Application : New Version 60.70 (Message 7854)
Posted 7 Nov 2022 by computezrmle
Post:
As for vboxwrapper 26206 this task is running fine:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3141537

I'm still missing the modifications of the vdi's boot partition that should make CVMFS fail-over and load balancing more robust.
Are there plans to implement them before this vdi is used on prod?
15) Message boards : CMS Application : New Version 60.67 (Message 7849)
Posted 3 Nov 2022 by computezrmle
Post:
After a project reset on my test client I ran a CMS v60.67 task (reduced to singlecore) last night.
The task finished successfully.

Today I got another task that runs as 2-core.
checked:
- pausing/resuming the task
- pausing -> BOINC shutdown -> waiting a while -> restarting BOINC -> resuming the task

So far the task passed all tests and runs fine.
It will take some hours until it is done.
16) Message boards : CMS Application : New Version 60.67 (Message 7848)
Posted 3 Nov 2022 by computezrmle
Post:
...I noticed that Vbox7 keeps the used harddisks in the VirtuBox.xml file even after a reboot and no BOINC-VMs in use.

    <MediaRegistry>
      <HardDisks>
        <HardDisk uuid="{6f08958e-7bfd-4804-8dd7-c7b4408cb126}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{997e0796-142b-4278-9763-0bceb3ac71bc}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/ATLAS_vbox_1.17_image.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{dae25e8f-de18-4971-b11c-eca764ede402}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{8fb925ef-3497-4bfb-88e3-bbab2930787f}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/CMS_2022_09_07.vdi" format="VDI" type="MultiAttach"/>
      </HardDisks>

Indeed, v7.x writes the multiattach parents to the global store while older versions write them to the VM description files using the parent.
This may allow it to avoid the workaround in future vboxwrapper versions if it detects vbox >=v7.

Since the workaround is only executed in case of an error and we have done lots of test to get it stable I'd like to avoid any change for now.
But I'll keep that in mind.
17) Message boards : CMS Application : New Version 60.67 (Message 7843)
Posted 2 Nov 2022 by computezrmle
Post:
At the beginning of the VM's "MasterLog" there are lots of error messages like:
...Failed to decode JWT in keyfile...

The same messages are printed to the StartdLog.

Should be forwarded to the developers of the scientific app.

11/02/22 17:53:59 (pid:15609) ******************************************************
11/02/22 17:53:59 (pid:15609) ** condor_master (CONDOR_MASTER) STARTING UP
11/02/22 17:53:59 (pid:15609) ** /tmp/glide_4xHWR4/main/condor/usr/sbin/condor_master
11/02/22 17:53:59 (pid:15609) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
11/02/22 17:53:59 (pid:15609) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
11/02/22 17:53:59 (pid:15609) ** $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $
11/02/22 17:53:59 (pid:15609) ** $CondorPlatform: x86_64_CentOS7 $
11/02/22 17:53:59 (pid:15609) ** PID = 15609
11/02/22 17:53:59 (pid:15609) ** Log last touched time unavailable (No such file or directory)
11/02/22 17:53:59 (pid:15609) ******************************************************
11/02/22 17:53:59 (pid:15609) Using config source: /tmp/glide_4xHWR4/condor_config
11/02/22 17:53:59 (pid:15609) config Macros = 326, Sorted = 326, StringBytes = 21599, TablesBytes = 11776
11/02/22 17:53:59 (pid:15609) CLASSAD_CACHING is OFF
11/02/22 17:53:59 (pid:15609) Daemon Log is logging: D_ALWAYS D_ERROR
11/02/22 17:53:59 (pid:15609) Daemoncore: Listening at <0.0.0.0:44699> on TCP (ReliSock).
11/02/22 17:53:59 (pid:15609) DaemonCore: command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250&noUDP>
11/02/22 17:53:59 (pid:15609) DaemonCore: private command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250>
11/02/22 17:54:00 (pid:15609) Failed to decode JWT in keyfile '/tmp/glide_4xHWR4/ticket/myproxy'; ignoring.
((repeats this message about 80 times))


11/02/22 17:54:00 (pid:15609) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=[2001-1458-d00-14--b3]-9618+137.138.156.85-9618&alias=vocms0840.cern.ch#7702380
11/02/22 17:54:00 (pid:15609) Master restart (GRACEFUL) is watching /tmp/glide_4xHWR4/main/condor/sbin/condor_master (mtime:1667407914)
11/02/22 17:54:00 (pid:15609) Started DaemonCore process "/tmp/glide_4xHWR4/main/condor/sbin/condor_startd", pid and pgroup = 15612
11/02/22 17:54:00 (pid:15609) Daemons::StartAllDaemons all daemons were started
11/02/22 17:54:02 (pid:15609) Setting ready state 'Ready' for STARTD
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
((repeats this message about 80 times))


11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
18) Message boards : CMS Application : New Version 60.67 (Message 7842)
Posted 2 Nov 2022 by computezrmle
Post:
Got a task that started fine.

The good thing from stderr.txt:
It has no issues getting X509 credentials.
2022-11-02 17:51:33 (91056): Guest Log: [INFO] Reading volunteer information
2022-11-02 17:51:35 (91056): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2022-11-02 17:51:36 (91056): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2022-11-02 17:51:37 (91056): Guest Log: [INFO] CMS application starting. Check log files.


@Laurence
This might be helpful to solve the X509 issue on prod.
19) Message boards : CMS Application : New Version 60.67 (Message 7841)
Posted 2 Nov 2022 by computezrmle
Post:
OK.
This looks like fresh compiled executables from the known source code.
I agree, they should be tested since on github there might have been changes affecting the compilers/development tools.
20) Message boards : CMS Application : New Version 60.67 (Message 7839)
Posted 2 Nov 2022 by computezrmle
Post:
That vboxwrapper reports it's version as 26205 but BOINC does not yet offer it for download.
They offer vboxwrapper up to 26204.
https://boinc.berkeley.edu/dl/


ATLAS and CMS on prod already use a 26205 taken from a github artefact (compiled by the BOINC CI).
That artefact will become an official 26205 once the BOINC process owners put it on the download page.

Where is the version used for this CMS app taken from?


Next 20


©2023 CERN