81) Message boards : News : Server Release 1.4.0 (Message 7876)
Posted 10 Nov 2022 by computezrmle
Post:
Did you set this in your BOINC client's cc_config.xml?
<dont_check_file_sizes>1</dont_check_file_sizes>
82) Message boards : ATLAS Application : ATLAS vbox v.1.17 (Message 7868)
Posted 8 Nov 2022 by computezrmle
Post:
Theory and CMS are already updated.
ATLAS may follow next week I guess.
83) Message boards : CMS Application : New Version 60.70 (Message 7866)
Posted 7 Nov 2022 by computezrmle
Post:
7.x might work, but since it is rather new you may stumble over unexpected issues.

I'm already aware of a modification that affects the media manager.
So far it's not a show stopper here but it needs a closer look.
84) Message boards : CMS Application : New Version 60.70 (Message 7865)
Posted 7 Nov 2022 by computezrmle
Post:
Some kind of a "bootstrap preloader" that does this:
1. mount the shared folder
2. parse init_data.xml from there (it tells you whether you are in dev or prod)
3. modify the link to the main bootstrap script in /sbin according to (2.)
4. mount grid.cern.ch (the link points to this repo; unlike ATLAS which gets it's boot script from atlas.cern.ch)
5. execute the main bootstrap script on CVMFS


This is very close to the ATLAS script.
I'll prepare a suggestion.
85) Message boards : CMS Application : New Version 60.70 (Message 7863)
Posted 7 Nov 2022 by computezrmle
Post:
Your CVMFS requests from Theory tasks are also sent to *.openhtc.io.
Surely to the same Cloudflare datacenter.
Could even be that CVMFS and frontier requests are processed by the very same Cloudflare Squid instance there.

Those Squids get their data from backend systems at CERN, RAL, Fermilab ... and do an automatic fail-over.
Very unlikely that all backend systems are down at the same moment.
Especially since this would crash nearly all CMS tasks worldwide.


Please upgrade VirtualBox to the recent v6.1.
The new vboxwrapper 26206 does not have the .com interface any more which was responsible for problems in the past.
86) Message boards : CMS Application : New Version 60.70 (Message 7860)
Posted 7 Nov 2022 by computezrmle
Post:
*.openhtc.io responds to both, ipv4 and ipv6.

from your screenshot -> 188.114.96.1
At least this one should have worked.

It is clearly reported by the frontier client.
Was it a transient error (1 task only) or do all tasks report it?

Could you check whether Cloudflare is blocked in your firewall (maybe only for this box)?
87) Message boards : CMS Application : New Version 60.70 (Message 7859)
Posted 7 Nov 2022 by computezrmle
Post:
@Laurence
The CMS vdi currently in use sets this link in /usr/sbin/bootstrap:
/cvmfs/grid.cern.ch/vc/vm-qa/sbin/bootstrap-idtoken

bootstrap-idtoken sets this variable:
branch=qa

If it was the intention to implement a switch dev/prod it will not work.


I recently implemented a switch for ATLAS that solves the same objective:
https://github.com/davidgcameron/boinc-scripts/blob/master/vbox/ATLASbootstrap.sh#L41-L45

It could easily be rewritten and tested for CMS.
Just give me a "go".
88) Message boards : CMS Application : New Version 60.70 (Message 7854)
Posted 7 Nov 2022 by computezrmle
Post:
As for vboxwrapper 26206 this task is running fine:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3141537

I'm still missing the modifications of the vdi's boot partition that should make CVMFS fail-over and load balancing more robust.
Are there plans to implement them before this vdi is used on prod?
89) Message boards : CMS Application : New Version 60.67 (Message 7849)
Posted 3 Nov 2022 by computezrmle
Post:
After a project reset on my test client I ran a CMS v60.67 task (reduced to singlecore) last night.
The task finished successfully.

Today I got another task that runs as 2-core.
checked:
- pausing/resuming the task
- pausing -> BOINC shutdown -> waiting a while -> restarting BOINC -> resuming the task

So far the task passed all tests and runs fine.
It will take some hours until it is done.
90) Message boards : CMS Application : New Version 60.67 (Message 7848)
Posted 3 Nov 2022 by computezrmle
Post:
...I noticed that Vbox7 keeps the used harddisks in the VirtuBox.xml file even after a reboot and no BOINC-VMs in use.

    <MediaRegistry>
      <HardDisks>
        <HardDisk uuid="{6f08958e-7bfd-4804-8dd7-c7b4408cb126}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{997e0796-142b-4278-9763-0bceb3ac71bc}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/ATLAS_vbox_1.17_image.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{dae25e8f-de18-4971-b11c-eca764ede402}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" format="VDI" type="MultiAttach"/>
        <HardDisk uuid="{8fb925ef-3497-4bfb-88e3-bbab2930787f}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/CMS_2022_09_07.vdi" format="VDI" type="MultiAttach"/>
      </HardDisks>

Indeed, v7.x writes the multiattach parents to the global store while older versions write them to the VM description files using the parent.
This may allow it to avoid the workaround in future vboxwrapper versions if it detects vbox >=v7.

Since the workaround is only executed in case of an error and we have done lots of test to get it stable I'd like to avoid any change for now.
But I'll keep that in mind.
91) Message boards : CMS Application : New Version 60.67 (Message 7843)
Posted 2 Nov 2022 by computezrmle
Post:
At the beginning of the VM's "MasterLog" there are lots of error messages like:
...Failed to decode JWT in keyfile...

The same messages are printed to the StartdLog.

Should be forwarded to the developers of the scientific app.

11/02/22 17:53:59 (pid:15609) ******************************************************
11/02/22 17:53:59 (pid:15609) ** condor_master (CONDOR_MASTER) STARTING UP
11/02/22 17:53:59 (pid:15609) ** /tmp/glide_4xHWR4/main/condor/usr/sbin/condor_master
11/02/22 17:53:59 (pid:15609) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
11/02/22 17:53:59 (pid:15609) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
11/02/22 17:53:59 (pid:15609) ** $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $
11/02/22 17:53:59 (pid:15609) ** $CondorPlatform: x86_64_CentOS7 $
11/02/22 17:53:59 (pid:15609) ** PID = 15609
11/02/22 17:53:59 (pid:15609) ** Log last touched time unavailable (No such file or directory)
11/02/22 17:53:59 (pid:15609) ******************************************************
11/02/22 17:53:59 (pid:15609) Using config source: /tmp/glide_4xHWR4/condor_config
11/02/22 17:53:59 (pid:15609) config Macros = 326, Sorted = 326, StringBytes = 21599, TablesBytes = 11776
11/02/22 17:53:59 (pid:15609) CLASSAD_CACHING is OFF
11/02/22 17:53:59 (pid:15609) Daemon Log is logging: D_ALWAYS D_ERROR
11/02/22 17:53:59 (pid:15609) Daemoncore: Listening at <0.0.0.0:44699> on TCP (ReliSock).
11/02/22 17:53:59 (pid:15609) DaemonCore: command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250&noUDP>
11/02/22 17:53:59 (pid:15609) DaemonCore: private command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250>
11/02/22 17:54:00 (pid:15609) Failed to decode JWT in keyfile '/tmp/glide_4xHWR4/ticket/myproxy'; ignoring.
((repeats this message about 80 times))


11/02/22 17:54:00 (pid:15609) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=[2001-1458-d00-14--b3]-9618+137.138.156.85-9618&alias=vocms0840.cern.ch#7702380
11/02/22 17:54:00 (pid:15609) Master restart (GRACEFUL) is watching /tmp/glide_4xHWR4/main/condor/sbin/condor_master (mtime:1667407914)
11/02/22 17:54:00 (pid:15609) Started DaemonCore process "/tmp/glide_4xHWR4/main/condor/sbin/condor_startd", pid and pgroup = 15612
11/02/22 17:54:00 (pid:15609) Daemons::StartAllDaemons all daemons were started
11/02/22 17:54:02 (pid:15609) Setting ready state 'Ready' for STARTD
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
((repeats this message about 80 times))


11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS =  x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb.  The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad.
92) Message boards : CMS Application : New Version 60.67 (Message 7842)
Posted 2 Nov 2022 by computezrmle
Post:
Got a task that started fine.

The good thing from stderr.txt:
It has no issues getting X509 credentials.
2022-11-02 17:51:33 (91056): Guest Log: [INFO] Reading volunteer information
2022-11-02 17:51:35 (91056): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2022-11-02 17:51:36 (91056): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2022-11-02 17:51:37 (91056): Guest Log: [INFO] CMS application starting. Check log files.


@Laurence
This might be helpful to solve the X509 issue on prod.
93) Message boards : CMS Application : New Version 60.67 (Message 7841)
Posted 2 Nov 2022 by computezrmle
Post:
OK.
This looks like fresh compiled executables from the known source code.
I agree, they should be tested since on github there might have been changes affecting the compilers/development tools.
94) Message boards : CMS Application : New Version 60.67 (Message 7839)
Posted 2 Nov 2022 by computezrmle
Post:
That vboxwrapper reports it's version as 26205 but BOINC does not yet offer it for download.
They offer vboxwrapper up to 26204.
https://boinc.berkeley.edu/dl/


ATLAS and CMS on prod already use a 26205 taken from a github artefact (compiled by the BOINC CI).
That artefact will become an official 26205 once the BOINC process owners put it on the download page.

Where is the version used for this CMS app taken from?
95) Message boards : ATLAS Application : ATLAS vbox v.1.17 (Message 7830)
Posted 20 Oct 2022 by computezrmle
Post:
Not sure what problem should be solved.

1.
The starting order (systemd services) was undefined, hence random.
Now it's explicitly defined that an ATLAS job doesn't start prior to the availability of autofs (=CVMFS mounts) and the shared folder mount.

2.
Some quoting issues were fixed that affected the CVMFS fail-over.
old: fail-over didn't work when the backend behind s1cern-cvmfs.openhtc.io was down
new: CVMFS now correctly switches to a fail-over server and automatically back when the closest server is back
96) Message boards : ATLAS Application : ATLAS vbox v.1.17 (Message 7828)
Posted 19 Oct 2022 by computezrmle
Post:
Looks fine.
All changes seem to work as expected.
97) Message boards : CMS Application : New Version 60.66 (Message 7806)
Posted 16 Sep 2022 by computezrmle
Post:
Even with a local Squid there are a couple of Frontier fail-overs.
These are my numbers from the last week (#requests/data transferred)

default server
cms-frontier.openhtc.io    2.877.375   10.35 GB

fail-overs
cms1-frontier.openhtc.io       4.064   18.12 MB
cms2-frontier.openhtc.io           7   33.13 KB
cms3-frontier.openhtc.io           6   31.03 KB
cms4-frontier.openhtc.io           6   30.91 KB


default server
atlascern-frontier.openhtc.io   1.688.154     9.07 GB

fail-overs
atlascern1-frontier.openhtc.io      3.643    20.55 MB
atlascern4-frontier.openhtc.io        498     2.72 MB
atlascern2-frontier.openhtc.io        285   751.86 KB
atlascern3-frontier.openhtc.io        196   520.31 KB
98) Message boards : CMS Application : New Version 60.66 (Message 7804)
Posted 16 Sep 2022 by computezrmle
Post:
The warnings are reported by the Frontier client inside the VM.
Since at the end the jobs are running the fail-over obviously works.

The issues can be, e.g.
- 1 (or more) frontier server temporarily not responding => Frontier contacts the next one
- a local network overload (mostly the router); too many open connections

If it's the latter it is caused by a high peak load from Frontier.
Frontier sends data in chunks (16 kB each IIRC) which concurrently opens lots of TCP connections.
This works fine until the router's resources are fully used.
New connections are not accepted until older (but idle) connections time out.

A local Squid solves this since
- in case of CMS up to 98 % of the Frontier requests can be returned by Squid
- a local Squid usually doesn't use chunks
99) Message boards : CMS Application : New Version 60.66 (Message 7793)
Posted 15 Sep 2022 by computezrmle
Post:
... And we run a different version of CMS here too.

Not really.
The vdi is different, but this is just the 'envelope'.
The CMS payload comes from the same backend systems (WMAgent/HTCondor).
If their queue runs dry -dev and -prod are both affected.
100) Message boards : CMS Application : New Version 60.65 (Message 7777)
Posted 2 Sep 2022 by computezrmle
Post:
Would be better not to run CMS ATM since the subtask queue is empty.
Looks like somebody just refilled the queue.


Previous 20 · Next 20


©2024 CERN