Message boards : ATLAS Application : Testing CentOS 7 vbox image
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6622 - Posted: 12 Sep 2019, 16:31:41 UTC
Last modified: 12 Sep 2019, 16:32:14 UTC

I suspended (LAIM off) the first task with v0.83 several times. The task was processed by 4 different BOINC processes sequential.
Save times of the snapshots 35s, 34s and 24s (last one was with lesser than 4 athena's processes).
The double not expected lines in the result after resume, do not always appear. See my result.

Windows RDP ALT-F2 (events processing) was OK. Top ALT-F3 you improved meanwhile.
ID: 6622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,332,099
RAC: 499
Message 6623 - Posted: 12 Sep 2019, 19:10:38 UTC

@maeax does this host run production ATLAS jobs ok?

I guess this URL will change to an official host when you move it to the production environment:


Yes, I put the script here for now so I can make quick changes. In production it will be taken from CVMFS as is done currently in the production project.

Port 25085 is closed at my main firewall.
This causes lots of failed requests like this:


This is due to the way the test jobs are defined, I will fix them to not use this server like the production jobs.

Be so kind as to post what method you have currently implemented and why.


You can find it in the above script - it's not pretty...

# tty3: top
cat > /home/atlas/top.sh << EOF
while true; do sleep 5; top -b -n1 | head -36 >/dev/tty3 2>/dev/null; done
EOF
sudo sh /home/atlas/top.sh &


I remember this was the only way to get it to work after trying several different ideas, but I'm sure there is a better way that I was not able to find before.
ID: 6623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6624 - Posted: 13 Sep 2019, 3:06:53 UTC - in response to Message 6623.  

@maeax does this host run production ATLAS jobs ok?

Yes, two at one time with 6 CPU's always.
Will test it today when only -dev is running.
ID: 6624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6625 - Posted: 13 Sep 2019, 5:29:05 UTC
Last modified: 13 Sep 2019, 5:57:56 UTC

Among all success tasks I had one error so far. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2821975
Machine running, me sleeping.
ID: 6625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6626 - Posted: 13 Sep 2019, 8:28:02 UTC - in response to Message 6624.  
Last modified: 13 Sep 2019, 9:10:50 UTC

@maeax does this host run production ATLAS jobs ok?

Yes, two at one time with 6 CPU's always.
Will test it today when only -dev is running.

Have running this task alone. Same Hypervisor failed.
This are the last lines of the boinc vbox.log
00:00:11.941361 VBVA: InfoScreen: [0] @0,0 800x600, line 0xc80, BPP 32, flags 0x1
00:00:11.941381 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000be90000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1
00:00:13.522348 NAT: IPv6 not supported
00:00:13.649943 NAT: DHCP offered IP address 10.0.2.15
00:00:14.612321 VMMDev: Guest Log: Checking CVMFS...
00:00:21.520632 VMMDev: Guest Log: CVMFS is ok
00:00:21.682587 VMMDev: Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log
00:00:21.682607 VMMDev: Guest Log: 00:00:00.000279 main Log opened 2019-09-13T10:09:20.377676000Z
00:00:21.682675 VMMDev: Guest Log: 00:00:00.000407 main OS Product: Linux
00:00:21.682719 VMMDev: Guest Log: 00:00:00.000451 main OS Release: 3.10.0-957.27.2.el7.x86_64
00:00:21.682751 VMMDev: Guest Log: 00:00:00.000487 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
00:00:21.682786 VMMDev: Guest Log: 00:00:00.000520 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService
00:00:21.682795 VMMDev: Guest Log: 00:00:00.000520 main Process ID: 1814
00:00:21.682801 VMMDev: Guest Log: 00:00:00.000521 main Package type: LINUX_64BITS_GENERIC
00:00:21.693430 VMMDev: Guest Log: 00:00:00.011151 main 5.2.32 r132073 started. Verbose level = 0
00:00:21.693948 VMMDev: Guest Log: Mounting shared directory
00:00:21.694902 Guest Control: GUEST_MSG_REPORT_FEATURES: 0x1, 0x8000000000000000
00:00:21.721161 VMMDev: Guest Log: 00:00:00.038848 automount vbsvcAutoMountWorker: Shared folder 'shared' was mounted to '/media/sf_shared'
00:00:21.808752 VMMDev: Guest Log: Copying input files
00:00:24.177400 VMMDev: Guest Log: Copied input files into RunAtlas.
00:00:24.841091 VMMDev: Guest Log: copied the webapp to /var/www
00:00:24.908933 VMMDev: Guest Log: This vm does not need to setup an http proxy
00:00:24.911599 VMMDev: Guest Log: ATHENA_PROC_NUMBER=2
00:00:25.084285 VMMDev: Guest Log: *** Starting ATLAS job. (PandaID=4002876565 taskID=000649-2) ***
00:00:31.702868 VMMDev: Guest Log: 00:00:10.020597 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 189 420 333 000ns (GuestNow=1 568 362 170 970 325 000 ns GuestLast=1 568 369 360 390 658 000 ns fSetTimeLastLoop=true )
00:08:51.735051 VMMDev: SetVideoModeHint: Got a video mode hint (800x600x32)@(0x0),(1;0) at 0
00:09:01.665921 TM: Giving up catch-up attempt at a 60 000 001 604 ns lag; new total: 60 000 001 604 ns
00:10:08.666614 VBVA: InfoScreen: [0] @0,0 800x600, line 0xc80, BPP 0, flags 0x5

00:10:08.666646 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000be90000 w=800 h=600 bpp=0 cbLine=0xC80 flags=0x5

This are the Atlas-tasks from production for this host:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10548292
ID: 6626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6627 - Posted: 13 Sep 2019, 9:17:40 UTC
Last modified: 13 Sep 2019, 9:28:49 UTC

I upgraded VirtualBox 6.0.10 to version 6.0.12 on my host.
Tasks returned after 09:15 UTC are processed with this newest VBox version (Windows).

On tty1 I got this, but 4 athena's were running:
ID: 6627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6628 - Posted: 13 Sep 2019, 9:36:14 UTC

Virtualbox preferences Screen is set to automatic.
Have the CentOS Atlas no default screen, maybe 640x480?
ID: 6628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 37
Credit: 226,771
RAC: 20
Message 6629 - Posted: 13 Sep 2019, 14:55:23 UTC

I keep getting this failure message:

9/13/2019 10:51:20 AM | lhcathome-dev | Task 8ukMDmqecSvnShfckohDCDFpABFKDmABFKDm9l7ZDmABFKDmVXE4Wo_1 postponed for 86400 seconds: VM Hypervisor failed to enter an online state in a timely fashion.
ID: 6629 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6630 - Posted: 13 Sep 2019, 17:34:02 UTC - in response to Message 6629.  

I keep getting this failure message:

9/13/2019 10:51:20 AM | lhcathome-dev | Task 8ukMDmqecSvnShfckohDCDFpABFKDmABFKDm9l7ZDmABFKDmVXE4Wo_1 postponed for 86400 seconds: VM Hypervisor failed to enter an online state in a timely fashion.

This is often happening on busy systems. Maybe you have more projects running evt. also using Virtual Machines.
See my post: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4899&postid=37578
If you don't want to wait a day, you can restart BOINC.
ID: 6630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6631 - Posted: 13 Sep 2019, 17:40:36 UTC

Crystal,
have vboxsvc.exe on lower priority, but it must be something other.
Had running this CentOS Atlas alone and got the same message as rbpeake.
Have googled it. Can be a timer with more than 60 sec timeout or something other.
ID: 6631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6632 - Posted: 14 Sep 2019, 4:36:34 UTC - in response to Message 6631.  

Have a second Computer started with -dev Atlas, where production is running.
Same message: VM Hypervisor failed...
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3958
production: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10567798
ID: 6632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6633 - Posted: 14 Sep 2019, 5:57:11 UTC

The queue has dried.

I got one other error. The HITS-file was produced and copied to the shared folder, the log tells.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2822135

The tty1-error, I mentioned before, has appeared only once.
ID: 6633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,332,099
RAC: 499
Message 6634 - Posted: 16 Sep 2019, 10:03:21 UTC - in response to Message 6633.  

The queue has dried.

I got one other error. The HITS-file was produced and copied to the shared folder, the log tells.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2822135


Indeed it looks like the task was successful, but the results file could not be found. I saw a few failures like this for another volunteer (also with vbox 6) here who had also run many successful tasks. I will update my client back to version 6 and see if I also get the problem.

The tty1-error, I mentioned before, has appeared only once.


I have seen this one myself too, but could not find out what the problem was. It didn't seem to affect the task.
ID: 6634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6636 - Posted: 17 Sep 2019, 9:33:36 UTC

Tasks returned after 16 Sep 2019, 16:00 UTC are processed by a different vboxwrapper.
ATLAS default vboxwrapper (v26202) uses VirtualBox COM Interface. The wrapper I'm using at the moment uses VirtualBox VboxManage Interface.
That wrapper is built some (longer) time ago by Laurence Field, if I remember correctly and by LHC known as vboxwrapper_26198ab7_windows_x86_64.exe.
There are several differences, but I will mention a few.

- BOINC client detection is wrong (minor issue, first line shows the right version)
- I don't see the 'fake' lines in stderr anymore suggesting that a task is started from the beginning after resuming a suspended VM. (corresponding what's really happening).
- Before the suspend the VM first is paused and then saved.
 00:10:23.870304 Console: Machine state changed to 'Paused'
 00:10:23.871832 Console: Machine state changed to 'Saving'
 00:10:23.874439 Changing the VM state from 'SUSPENDED' to 'SAVING'
- At the end of the task there is a grace period of about 5 minutes before the task is really ended and cleaned by this wrapper and BOINC. (minor issue for long running tasks loosing some time).

Btw independent of used wrapper:
I had another too long saving period causing an aborted VM due to busy system.
Using a 2-core VM in stead of 4-core, the saving time is about 10 seconds shorter and the save set is about 30% smaller.
ID: 6636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6637 - Posted: 17 Sep 2019, 11:59:15 UTC

Crystal, are you only Win7pro using?
Have now the third Computer Win10pro and Hypervisor failed.
The graphic and RDP in Boincmanager is not shown in this 10 Min. (Always 2 Core).
ID: 6637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6638 - Posted: 17 Sep 2019, 13:00:25 UTC - in response to Message 6637.  

Crystal, are you only Win7pro using?
For Vbox-tasks only Win7 64bit home edition.
I had Win10 32bit running for Theory, but I stopped with it cause tasks suddenly failed after hundreds of successes.

Have now the third Computer Win10pro and Hypervisor failed.
The graphic and RDP in Boincmanager is not shown in this 10 Min. (Always 2 Core).
Something wrong? with VirtualBox on this machine https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2822072 or do you have HyperV
running on your Win10 machines. HyperV and VirtualBox biting each other, so turn off HyperV.
My RDP and Graphics via BOINC Manager are fine, although no access to VM-machine Logs via Graphics. Graphics not very useful.
ID: 6638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,368,400
RAC: 2,202
Message 6639 - Posted: 17 Sep 2019, 13:46:15 UTC
Last modified: 17 Sep 2019, 13:49:04 UTC

Thanks Crystal,
production is running on all three PC well, see my early posting.
Maybe something with the wrapper you wrote.
Waiting for the CentOS with Vers. 6.0.x from David.
BTW: for a 2 CPU task are 4800 MByte shown in Virtualbox.
Have now a app_config, no better result.
ID: 6639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6640 - Posted: 18 Sep 2019, 7:21:18 UTC

This task survived an overnight suspension.
2019-09-17 22:42:51 (6284): Stopping VM.
2019-09-18 07:08:58 (5272): Detected: vboxwrapper 26197
2019-09-18 07:08:58 (5272): Detected: BOINC client v7.7
2019-09-18 07:08:59 (5272): Detected: VirtualBox VboxManage Interface (Version: 6.0.12)
2019-09-18 07:08:59 (5272): Starting VM using VBoxManage interface. (boinc_ad3cabd059e69b2a, slot#0)
2019-09-18 07:09:32 (5272): Successfully started VM. (PID = '5476')
.
.
.
2019-09-18 07:54:20 (5272): Guest Log: -rw-------. 1 atlas atlas 9090837 Sep 18 05:51 /home/atlas/RunAtlas/HITS.000649-401055-14830._078090.pool.root.1

2019-09-18 07:54:20 (5272): Guest Log: Successfully finished the ATLAS job!

2019-09-18 07:54:20 (5272): Guest Log: Copying the results back to the shared directory!
ID: 6640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,332,099
RAC: 499
Message 6641 - Posted: 18 Sep 2019, 8:09:40 UTC - in response to Message 6636.  

Tasks returned after 16 Sep 2019, 16:00 UTC are processed by a different vboxwrapper.
ATLAS default vboxwrapper (v26202) uses VirtualBox COM Interface. The wrapper I'm using at the moment uses VirtualBox VboxManage Interface.


Are the problems you list in the 26202 wrapper or the LHC one? And do you suggest going back to the LHC one? On this web page it says that 26202 is the one that works with vbox 6: https://boinc.berkeley.edu/trac/wiki/VboxApps
ID: 6641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1146
Credit: 750,252
RAC: 1,445
Message 6642 - Posted: 18 Sep 2019, 9:07:20 UTC - in response to Message 6641.  
Last modified: 18 Sep 2019, 9:24:00 UTC

Are the problems you list in the 26202 wrapper or the LHC one?
ATLAS default vboxwrapper (v26202) using VirtualBox COM Interface:

- Rather often inexplicable 'fake' lines in stderr, suggesting that a task is started from the beginning after resuming a suspended VM (Work is continuing from the saved state as it should be).
- Rarely not uploading a result, although HITS-file is produced (with 200 events tasks very annoying).

LHC vboxwrapper (v26198ab7) using VboxManage Interface:

- BOINC client detection is wrong (minor issue, first line in result shows the right BOINC-version)
- At the end of the task there is a grace period of about 5 minutes before the task is really ended and cleaned by this wrapper and BOINC. (minor issue for long running tasks loosing some time).

And do you suggest going back to the LHC one? On this web page it says that 26202 is the one that works with vbox 6: https://boinc.berkeley.edu/trac/wiki/VboxApps
That's up to you. I'm testing with Windows 7 and VBox 6.0.12. No idea how Linux and Windows 10 will do.
On LHC-production Theory and CMS are also using vboxwrapper v26198ab7 for a long time and ATLAS uses vboxwrapper v26196.
I had no errors so far as you can see in my results list with the LHC wrapper.
ID: 6642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS Application : Testing CentOS 7 vbox image


©2022 CERN