Message boards :
ATLAS Application :
Testing CentOS 7 vbox image
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I suspended (LAIM off) the first task with v0.83 several times. The task was processed by 4 different BOINC processes sequential. Save times of the snapshots 35s, 34s and 24s (last one was with lesser than 4 athena's processes). The double not expected lines in the result after resume, do not always appear. See my result. Windows RDP ALT-F2 (events processing) was OK. Top ALT-F3 you improved meanwhile. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
@maeax does this host run production ATLAS jobs ok? I guess this URL will change to an official host when you move it to the production environment: Yes, I put the script here for now so I can make quick changes. In production it will be taken from CVMFS as is done currently in the production project. Port 25085 is closed at my main firewall. This is due to the way the test jobs are defined, I will fix them to not use this server like the production jobs. Be so kind as to post what method you have currently implemented and why. You can find it in the above script - it's not pretty... # tty3: top cat > /home/atlas/top.sh << EOF while true; do sleep 5; top -b -n1 | head -36 >/dev/tty3 2>/dev/null; done EOF sudo sh /home/atlas/top.sh & I remember this was the only way to get it to work after trying several different ideas, but I'm sure there is a better way that I was not able to find before. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
@maeax does this host run production ATLAS jobs ok? Yes, two at one time with 6 CPU's always. Will test it today when only -dev is running. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
Among all success tasks I had one error so far. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2821975 Machine running, me sleeping. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
@maeax does this host run production ATLAS jobs ok? Have running this task alone. Same Hypervisor failed. This are the last lines of the boinc vbox.log 00:00:11.941361 VBVA: InfoScreen: [0] @0,0 800x600, line 0xc80, BPP 32, flags 0x1 00:00:11.941381 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000be90000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 00:00:13.522348 NAT: IPv6 not supported 00:00:13.649943 NAT: DHCP offered IP address 10.0.2.15 00:00:14.612321 VMMDev: Guest Log: Checking CVMFS... 00:00:21.520632 VMMDev: Guest Log: CVMFS is ok 00:00:21.682587 VMMDev: Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log 00:00:21.682607 VMMDev: Guest Log: 00:00:00.000279 main Log opened 2019-09-13T10:09:20.377676000Z 00:00:21.682675 VMMDev: Guest Log: 00:00:00.000407 main OS Product: Linux 00:00:21.682719 VMMDev: Guest Log: 00:00:00.000451 main OS Release: 3.10.0-957.27.2.el7.x86_64 00:00:21.682751 VMMDev: Guest Log: 00:00:00.000487 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019 00:00:21.682786 VMMDev: Guest Log: 00:00:00.000520 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService 00:00:21.682795 VMMDev: Guest Log: 00:00:00.000520 main Process ID: 1814 00:00:21.682801 VMMDev: Guest Log: 00:00:00.000521 main Package type: LINUX_64BITS_GENERIC 00:00:21.693430 VMMDev: Guest Log: 00:00:00.011151 main 5.2.32 r132073 started. Verbose level = 0 00:00:21.693948 VMMDev: Guest Log: Mounting shared directory 00:00:21.694902 Guest Control: GUEST_MSG_REPORT_FEATURES: 0x1, 0x8000000000000000 00:00:21.721161 VMMDev: Guest Log: 00:00:00.038848 automount vbsvcAutoMountWorker: Shared folder 'shared' was mounted to '/media/sf_shared' 00:00:21.808752 VMMDev: Guest Log: Copying input files 00:00:24.177400 VMMDev: Guest Log: Copied input files into RunAtlas. 00:00:24.841091 VMMDev: Guest Log: copied the webapp to /var/www 00:00:24.908933 VMMDev: Guest Log: This vm does not need to setup an http proxy 00:00:24.911599 VMMDev: Guest Log: ATHENA_PROC_NUMBER=2 00:00:25.084285 VMMDev: Guest Log: *** Starting ATLAS job. (PandaID=4002876565 taskID=000649-2) *** 00:00:31.702868 VMMDev: Guest Log: 00:00:10.020597 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 189 420 333 000ns (GuestNow=1 568 362 170 970 325 000 ns GuestLast=1 568 369 360 390 658 000 ns fSetTimeLastLoop=true ) 00:08:51.735051 VMMDev: SetVideoModeHint: Got a video mode hint (800x600x32)@(0x0),(1;0) at 0 00:09:01.665921 TM: Giving up catch-up attempt at a 60 000 001 604 ns lag; new total: 60 000 001 604 ns 00:10:08.666614 VBVA: InfoScreen: [0] @0,0 800x600, line 0xc80, BPP 0, flags 0x5 00:10:08.666646 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000be90000 w=800 h=600 bpp=0 cbLine=0xC80 flags=0x5 This are the Atlas-tasks from production for this host: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10548292 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I upgraded VirtualBox 6.0.10 to version 6.0.12 on my host. Tasks returned after 09:15 UTC are processed with this newest VBox version (Windows). On tty1 I got this, but 4 athena's were running: |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Virtualbox preferences Screen is set to automatic. Have the CentOS Atlas no default screen, maybe 640x480? |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
I keep getting this failure message: 9/13/2019 10:51:20 AM | lhcathome-dev | Task 8ukMDmqecSvnShfckohDCDFpABFKDmABFKDm9l7ZDmABFKDmVXE4Wo_1 postponed for 86400 seconds: VM Hypervisor failed to enter an online state in a timely fashion. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I keep getting this failure message: This is often happening on busy systems. Maybe you have more projects running evt. also using Virtual Machines. See my post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4899&postid=37578 If you don't want to wait a day, you can restart BOINC. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Crystal, have vboxsvc.exe on lower priority, but it must be something other. Had running this CentOS Atlas alone and got the same message as rbpeake. Have googled it. Can be a timer with more than 60 sec timeout or something other. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Have a second Computer started with -dev Atlas, where production is running. Same message: VM Hypervisor failed... https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3958 production: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10567798 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
The queue has dried. I got one other error. The HITS-file was produced and copied to the shared folder, the log tells. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2822135 The tty1-error, I mentioned before, has appeared only once. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
The queue has dried. Indeed it looks like the task was successful, but the results file could not be found. I saw a few failures like this for another volunteer (also with vbox 6) here who had also run many successful tasks. I will update my client back to version 6 and see if I also get the problem. The tty1-error, I mentioned before, has appeared only once. I have seen this one myself too, but could not find out what the problem was. It didn't seem to affect the task. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
Tasks returned after 16 Sep 2019, 16:00 UTC are processed by a different vboxwrapper. ATLAS default vboxwrapper (v26202) uses VirtualBox COM Interface. The wrapper I'm using at the moment uses VirtualBox VboxManage Interface. That wrapper is built some (longer) time ago by Laurence Field, if I remember correctly and by LHC known as vboxwrapper_26198ab7_windows_x86_64.exe. There are several differences, but I will mention a few. - BOINC client detection is wrong (minor issue, first line shows the right version) - I don't see the 'fake' lines in stderr anymore suggesting that a task is started from the beginning after resuming a suspended VM. (corresponding what's really happening). - Before the suspend the VM first is paused and then saved. 00:10:23.870304 Console: Machine state changed to 'Paused' 00:10:23.871832 Console: Machine state changed to 'Saving' 00:10:23.874439 Changing the VM state from 'SUSPENDED' to 'SAVING'- At the end of the task there is a grace period of about 5 minutes before the task is really ended and cleaned by this wrapper and BOINC. (minor issue for long running tasks loosing some time). Btw independent of used wrapper: I had another too long saving period causing an aborted VM due to busy system. Using a 2-core VM in stead of 4-core, the saving time is about 10 seconds shorter and the save set is about 30% smaller. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Crystal, are you only Win7pro using? Have now the third Computer Win10pro and Hypervisor failed. The graphic and RDP in Boincmanager is not shown in this 10 Min. (Always 2 Core). |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
Crystal, are you only Win7pro using?For Vbox-tasks only Win7 64bit home edition. I had Win10 32bit running for Theory, but I stopped with it cause tasks suddenly failed after hundreds of successes. Have now the third Computer Win10pro and Hypervisor failed.Something wrong? with VirtualBox on this machine https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2822072 or do you have HyperV running on your Win10 machines. HyperV and VirtualBox biting each other, so turn off HyperV. My RDP and Graphics via BOINC Manager are fine, although no access to VM-machine Logs via Graphics. Graphics not very useful. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Thanks Crystal, production is running on all three PC well, see my early posting. Maybe something with the wrapper you wrote. Waiting for the CentOS with Vers. 6.0.x from David. BTW: for a 2 CPU task are 4800 MByte shown in Virtualbox. Have now a app_config, no better result. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
This task survived an overnight suspension. 2019-09-17 22:42:51 (6284): Stopping VM. 2019-09-18 07:08:58 (5272): Detected: vboxwrapper 26197 2019-09-18 07:08:58 (5272): Detected: BOINC client v7.7 2019-09-18 07:08:59 (5272): Detected: VirtualBox VboxManage Interface (Version: 6.0.12) 2019-09-18 07:08:59 (5272): Starting VM using VBoxManage interface. (boinc_ad3cabd059e69b2a, slot#0) 2019-09-18 07:09:32 (5272): Successfully started VM. (PID = '5476') . . . 2019-09-18 07:54:20 (5272): Guest Log: -rw-------. 1 atlas atlas 9090837 Sep 18 05:51 /home/atlas/RunAtlas/HITS.000649-401055-14830._078090.pool.root.1 2019-09-18 07:54:20 (5272): Guest Log: Successfully finished the ATLAS job! 2019-09-18 07:54:20 (5272): Guest Log: Copying the results back to the shared directory! |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Tasks returned after 16 Sep 2019, 16:00 UTC are processed by a different vboxwrapper. Are the problems you list in the 26202 wrapper or the LHC one? And do you suggest going back to the LHC one? On this web page it says that 26202 is the one that works with vbox 6: https://boinc.berkeley.edu/trac/wiki/VboxApps |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
Are the problems you list in the 26202 wrapper or the LHC one?ATLAS default vboxwrapper (v26202) using VirtualBox COM Interface: - Rather often inexplicable 'fake' lines in stderr, suggesting that a task is started from the beginning after resuming a suspended VM (Work is continuing from the saved state as it should be). - Rarely not uploading a result, although HITS-file is produced (with 200 events tasks very annoying). LHC vboxwrapper (v26198ab7) using VboxManage Interface: - BOINC client detection is wrong (minor issue, first line in result shows the right BOINC-version) - At the end of the task there is a grace period of about 5 minutes before the task is really ended and cleaned by this wrapper and BOINC. (minor issue for long running tasks loosing some time). And do you suggest going back to the LHC one? On this web page it says that 26202 is the one that works with vbox 6: https://boinc.berkeley.edu/trac/wiki/VboxAppsThat's up to you. I'm testing with Windows 7 and VBox 6.0.12. No idea how Linux and Windows 10 will do. On LHC-production Theory and CMS are also using vboxwrapper v26198ab7 for a long time and ATLAS uses vboxwrapper v26196. I had no errors so far as you can see in my results list with the LHC wrapper. |
©2024 CERN