Message boards :
CMS Application :
New Version 60.67
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Testing the upstream version of the vboxwrapper. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
That vboxwrapper reports it's version as 26205 but BOINC does not yet offer it for download. They offer vboxwrapper up to 26204. https://boinc.berkeley.edu/dl/ ATLAS and CMS on prod already use a 26205 taken from a github artefact (compiled by the BOINC CI). That artefact will become an official 26205 once the BOINC process owners put it on the download page. Where is the version used for this CMS app taken from? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This is pre-release testing. - linux: https://github.com/BOINC/boinc/releases/download/vboxwrapper%2F26205/vboxwrapper_26205_x86_64-pc-linux-gnu.zip - Windows: https://github.com/BOINC/boinc/releases/download/vboxwrapper%2F26205/vboxwrapper_26205_windows_x86_64.exe.zip |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
OK. This looks like fresh compiled executables from the known source code. I agree, they should be tested since on github there might have been changes affecting the compilers/development tools. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Got a task that started fine. The good thing from stderr.txt: It has no issues getting X509 credentials. 2022-11-02 17:51:33 (91056): Guest Log: [INFO] Reading volunteer information 2022-11-02 17:51:35 (91056): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2022-11-02 17:51:36 (91056): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2022-11-02 17:51:37 (91056): Guest Log: [INFO] CMS application starting. Check log files. @Laurence This might be helpful to solve the X509 issue on prod. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
At the beginning of the VM's "MasterLog" there are lots of error messages like: ...Failed to decode JWT in keyfile... The same messages are printed to the StartdLog. Should be forwarded to the developers of the scientific app. 11/02/22 17:53:59 (pid:15609) ****************************************************** 11/02/22 17:53:59 (pid:15609) ** condor_master (CONDOR_MASTER) STARTING UP 11/02/22 17:53:59 (pid:15609) ** /tmp/glide_4xHWR4/main/condor/usr/sbin/condor_master 11/02/22 17:53:59 (pid:15609) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 11/02/22 17:53:59 (pid:15609) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 11/02/22 17:53:59 (pid:15609) ** $CondorVersion: 9.0.11 Mar 12 2022 BuildID: 578027 PackageID: 9.0.11-1 $ 11/02/22 17:53:59 (pid:15609) ** $CondorPlatform: x86_64_CentOS7 $ 11/02/22 17:53:59 (pid:15609) ** PID = 15609 11/02/22 17:53:59 (pid:15609) ** Log last touched time unavailable (No such file or directory) 11/02/22 17:53:59 (pid:15609) ****************************************************** 11/02/22 17:53:59 (pid:15609) Using config source: /tmp/glide_4xHWR4/condor_config 11/02/22 17:53:59 (pid:15609) config Macros = 326, Sorted = 326, StringBytes = 21599, TablesBytes = 11776 11/02/22 17:53:59 (pid:15609) CLASSAD_CACHING is OFF 11/02/22 17:53:59 (pid:15609) Daemon Log is logging: D_ALWAYS D_ERROR 11/02/22 17:53:59 (pid:15609) Daemoncore: Listening at <0.0.0.0:44699> on TCP (ReliSock). 11/02/22 17:53:59 (pid:15609) DaemonCore: command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250&noUDP> 11/02/22 17:53:59 (pid:15609) DaemonCore: private command socket at <10.0.2.15:44699?addrs=10.0.2.15-44699&alias=408-3406-14250> 11/02/22 17:54:00 (pid:15609) Failed to decode JWT in keyfile '/tmp/glide_4xHWR4/ticket/myproxy'; ignoring. ((repeats this message about 80 times)) 11/02/22 17:54:00 (pid:15609) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=[2001-1458-d00-14--b3]-9618+137.138.156.85-9618&alias=vocms0840.cern.ch#7702380 11/02/22 17:54:00 (pid:15609) Master restart (GRACEFUL) is watching /tmp/glide_4xHWR4/main/condor/sbin/condor_master (mtime:1667407914) 11/02/22 17:54:00 (pid:15609) Started DaemonCore process "/tmp/glide_4xHWR4/main/condor/sbin/condor_startd", pid and pgroup = 15612 11/02/22 17:54:00 (pid:15609) Daemons::StartAllDaemons all daemons were started 11/02/22 17:54:02 (pid:15609) Setting ready state 'Ready' for STARTD 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:54:05 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. ((repeats this message about 80 times)) 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute GLIDEIN_Resource_Slots = Iotokens,80,,type=main. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. 11/02/22 17:58:43 (pid:15609) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the MASTER ad. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Testing the upstream version of the vboxwrapper. Thanks for the warning Laurence, Good luck here is the 2 I have running so far began a few hours ago and had no problems and I can update the other 2 in less than 6 hours from now so I can leave these running without checking them doing the update. Begin processing the 7110th record. Run 1, Event 78597110, LumiSection 157195 on stream 0 at 02-Nov-2022 22:59:47.425 CET Begin processing the 7111th record. Run 1, Event 78597111, LumiSection 157195 on stream 0 at 02-Nov-2022 22:59:49.140 CET Begin processing the 7112th record. Run 1, Event 78597112, LumiSection 157195 on stream 0 at 02-Nov-2022 22:59:51.755 CET Begin processing the 7113th record. Run 1, Event 78597113, LumiSection 157195 on stream 0 at 02-Nov-2022 22:59:53.421 CET Begin processing the 7114th record. Run 1, Event 78597114, LumiSection 157195 on stream 0 at 02-Nov-2022 22:59:55.036 CET |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
At the beginning of the VM's "MasterLog" there are lots of error messages like: they always say that on the MasterLog remind you of anything? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
I tested 1 CMS-task v60.67 and that returned fine. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3136136 Combination BOINC 7.20.2 and VBox 7.0.2 Somewhere in the middle of the task, I suspended the task a few minutes, where the state was saved to disk. Towards the end one task-suspend with keep in memory active. Nothing to do with this vboxwrapper, but I noticed that Vbox7 keeps the used harddisks in the VirtuBox.xml file even after a reboot and no BOINC-VMs in use. <MediaRegistry> <HardDisks> <HardDisk uuid="{6f08958e-7bfd-4804-8dd7-c7b4408cb126}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_2.02_image.vdi" format="VDI" type="MultiAttach"/> <HardDisk uuid="{997e0796-142b-4278-9763-0bceb3ac71bc}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/ATLAS_vbox_1.17_image.vdi" format="VDI" type="MultiAttach"/> <HardDisk uuid="{dae25e8f-de18-4971-b11c-eca764ede402}" location="D:/Boinc1/projects/lhcathome.cern.ch_lhcathome/CMS_2022_09_07_prod.vdi" format="VDI" type="MultiAttach"/> <HardDisk uuid="{8fb925ef-3497-4bfb-88e3-bbab2930787f}" location="D:/Boinc1/projects/lhcathomedev.cern.ch_lhcathome-dev/CMS_2022_09_07.vdi" format="VDI" type="MultiAttach"/> </HardDisks> |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
...I noticed that Vbox7 keeps the used harddisks in the VirtuBox.xml file even after a reboot and no BOINC-VMs in use. Indeed, v7.x writes the multiattach parents to the global store while older versions write them to the VM description files using the parent. This may allow it to avoid the workaround in future vboxwrapper versions if it detects vbox >=v7. Since the workaround is only executed in case of an error and we have done lots of test to get it stable I'd like to avoid any change for now. But I'll keep that in mind. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
After a project reset on my test client I ran a CMS v60.67 task (reduced to singlecore) last night. The task finished successfully. Today I got another task that runs as 2-core. checked: - pausing/resuming the task - pausing -> BOINC shutdown -> waiting a while -> restarting BOINC -> resuming the task So far the task passed all tests and runs fine. It will take some hours until it is done. |
©2024 CERN