Message boards : ATLAS Application : Testing CentOS 7 vbox image
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6596 - Posted: 4 Sep 2019, 14:41:12 UTC

Have to wait with the CentOS7 -dev after the two Atlas-Production with always 6 CPU's are finished.
Than can save Boinc and hopefully will start the CentOs7 again. Will not test a suspend because of your experience, Crystal.
ID: 6596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6597 - Posted: 4 Sep 2019, 17:45:00 UTC - in response to Message 6594.  

Edit: RAM 4.800 MByte. RDP-Console is in Boinc not useable.
VM Hypervisor failed to enter an online state in a timely fashion.

Have installed a app_config. 4 times a new Start Always Hypervisor failed.
ID: 6597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6598 - Posted: 5 Sep 2019, 15:36:06 UTC - in response to Message 6597.  

Is this variable from Virtualbox fault, when Hypervisor failed?
VERR_SVM_IN_USE (For AMD).
ID: 6598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6599 - Posted: 5 Sep 2019, 18:30:27 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2813576
This ATLAS VM started until the login screen appeared on all consoles.
Then it remained idle.
ID: 6599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6601 - Posted: 6 Sep 2019, 16:03:35 UTC - in response to Message 6590.  

Compacting the ATLAS vdi file

David Cameron wrote:
I've released version 0.81 which is a little smaller (3.4GB). The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted. So while "df" inside the VM reports 2.2GB used, the vdi is still 3.4GB, even after compacting.


To check the partition layout of the vdi file I attached it as /dev/sdb to a self made VM and found the following partitions:

/dev/sdb1: xfs, 1.00 GiB
/dev/sdb2: lvm2 pv (centos), 18.05 GiB
/dev/sdb3: extended
/dev/sdb5: linux-swap, 972.00 MiB

The reason why this vdi can't be compacted might be that /dev/sdb2 is part of a logical volume.
Hence I tried to reformat the partition with ext4.


Step 1:
Clone ATLAS_vbox_0.81_image.vdi as ATLAS_vbox_0.81_image2.vdi and attach the clone as /dev/sdc to the VM.

Step 2:
Reformat /dev/sdc2 with ext4 and reboot the VM.

Step 3:
Copy all files from /dev/sdb2 to /dev/sdc2 (rsync).

Step 4:
Zero non used space on /dev/sdc1 and /dev/sdc2 (required for "VBoxManage ... --compact"!)
To do this use temporary mountpoints /sdc/sdc1 and /sdc/sdc2
Then run:
cat /dev/zero >/sdc/sdc1/tmpzero
rm /sdc/sdc1/tmpzero
cat /dev/zero >/sdc/sdc2/tmpzero
rm /sdc/sdc2/tmpzero

Step 5:
Compact ATLAS_vbox_0.81_image2.vdi
VBoxManage modifyhd ATLAS_vbox_0.81_image2.vdi --compact


Results:
Original size: 3.1 GiB
Compacted (including the CVMFS cache): 2.3 GiB
Compacted (CVMFS cache removed): 897 MiB


Conclusion

A vdi partition formatted as part of an lvm volume seems to be overkill for a small standalone VM, especially as the vdi can't be compacted any more. Instead a reliable filesystem like ext4 (or XFS) should be used as default.

The method above is not intended to be used by the volunteers but (starting with step 4) should be used by the vdi maintainers before the vdi is released for production.
This would result in faster downloads as well as in a faster VM setup on the client computer since the vdi has to be copied from the project folder to the slots folder every time a new task starts.
ID: 6601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6602 - Posted: 8 Sep 2019, 8:11:40 UTC - in response to Message 6598.  

Is this variable from Virtualbox fault, when Hypervisor failed?
VERR_SVM_IN_USE (For AMD).

The deadline was reached yesterday, so canceled it
Always Hypervisor failed:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812272
ID: 6602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 116
Credit: 1,188,675
RAC: 117
Message 6603 - Posted: 10 Sep 2019, 10:51:00 UTC - in response to Message 6601.  

The method above is not intended to be used by the volunteers but (starting with step 4) should be used by the vdi maintainers before the vdi is released for production.
This would result in faster downloads as well as in a faster VM setup on the client computer since the vdi has to be copied from the project folder to the slots folder every time a new task starts.


I always do these last steps just before releasing a new vdi, but it could be as you say that the logical volume doesn't allow compacting. I will try to create a new VM with ext4 formatting to see if it helps. Thanks for the tips.

In the meantime I tried creating the image with Vbox version 6, and released 0.82 which has a 2.8GB image. This one should also fix the strange error you mentioned above where the CVMFS check failed and the WU got stuck doing nothing.
ID: 6603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 116
Credit: 1,188,675
RAC: 117
Message 6604 - Posted: 10 Sep 2019, 14:54:12 UTC - in response to Message 6603.  

I made a new image with an xfs file system for version 0.83. It seems it saves a little space but not that much (~100MB).
ID: 6604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1022
Credit: 621,104
RAC: 39
Message 6605 - Posted: 10 Sep 2019, 19:27:20 UTC - in response to Message 6604.  

David, is it possible to add the size of the HITS-file when moving to the shared folder?
ID: 6605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 116
Credit: 1,188,675
RAC: 117
Message 6608 - Posted: 11 Sep 2019, 9:26:15 UTC - in response to Message 6605.  

David, is it possible to add the size of the HITS-file when moving to the shared folder?


You have the size in the log messages:

2019-09-10 17:12:54 (166330): Guest Log: Looking for outputfile HITS.000649-2086812-31151._078090.pool.root.1
2019-09-10 17:12:54 (166330): Guest Log: HITS file was successfully produced
2019-09-10 17:12:54 (166330): Guest Log: -rw-------. 1 atlas atlas 9091177 Sep 10 15:12 /home/atlas/RunAtlas/HITS.000649-2086812-31151._078090.pool.root.1


"9091177" is the size of the HITS file. These test WU only process 10 events so it's much smaller than the real WU results.
ID: 6608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1022
Credit: 621,104
RAC: 39
Message 6610 - Posted: 11 Sep 2019, 11:24:10 UTC - in response to Message 6608.  

Sorry David, I must have mixed up the results of vbox versus native version.
Thus wrong thread. I don't see the size of the HITS-file in the native version.
One of your results: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2819384

Your test workunits for v0.82 and v0.83 were quickly distributed, I suppose.
ID: 6610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6613 - Posted: 12 Sep 2019, 8:02:44 UTC - in response to Message 6602.  
Last modified: 12 Sep 2019, 8:21:03 UTC

Always Hypervisor failed:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2821665

David,
is it possible to get a information to use only Virtualbox 6.0.x when the task starts.
For 5.2 Hypervisor failed after 10 Min. and you need to delete the Boinc-VM in Virtualbox manually.
ID: 6613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6614 - Posted: 12 Sep 2019, 9:47:16 UTC

I manually downloaded ATLAS_vbox_0.83_image.vdi and as the dev server has no tasks available I used the vdi to replace the original ATLAS-vdi from a test client that is attached to the production server.

That way the test client got a task:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=245678240

The task is processing hits but as it is a 1-core setup it will take a while until it is finished.

The stderr.txt already shows some lines that should be investigated:
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f
2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f


2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680177 main     Executable: /opt/VBoxGuestAdditions-6.0.12/sbin/VBoxService
2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680183 main     Process ID: 2151
2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680183 main     Package type: LINUX_64BITS_GENERIC
2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.779481 main     6.0.12 r133076 started. Verbose level = 0

The latter section points out that the VM uses vbox extensions 6.0.12.
While this seems to work on my hosts (currently version 6.0.10) it might cause problems if the host runs older vbox extensions as maeax mentioned.


BTW:
The vdi size is 2.4 GiB (uncompressed) and since the CVMFS cache is prefilled with roughly 1.5 GiB I couldn't compact it to a smaller size.
ID: 6614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 116
Credit: 1,188,675
RAC: 117
Message 6615 - Posted: 12 Sep 2019, 11:43:14 UTC - in response to Message 6614.  

I have put a lot more tasks in here now, it seems that some hosts take a lot of tasks at the same time which means there are none left for others.

I wonder if I should go back to VBox 5.2 if there are issues with version 6. I see some results look ok with 5.2 though.

Those int13_harddisk lines in the stderr have been there in every WU since I started this test, I tried to find out what causes it but couldn't find any information anywhere. I don't think it affects the tasks though.

Do the consoles work for you? With the images produced by VBox 6 I couldn't get them to work (using rdesktop-vrdp on linux).
ID: 6615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6616 - Posted: 12 Sep 2019, 12:08:21 UTC - in response to Message 6615.  
Last modified: 12 Sep 2019, 12:15:09 UTC

No David,
not back to 5.2.x, but a text for those who have this Version 5.2.
Hmm, will take a look why other running 5.2.x well.
Have other Atlas from production running parallell and saw this line:
Is this variable from Virtualbox fault, when Hypervisor failed?
VERR_SVM_IN_USE (For AMD).
ID: 6616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6617 - Posted: 12 Sep 2019, 12:24:34 UTC - in response to Message 6615.  

Those int13_harddisk lines in the stderr have been there in every WU since I started this test, I tried to find out what causes it but couldn't find any information anywhere.

It usually indicates a missing device (harddisk) so I first thought it was caused by the logical volume layout.
Since the partitions are now XFS it must be caused by another reason.

Did you (or centos) configure some kind of raid when you initially started to set up the VM?

I don't think it affects the tasks though.

Right.


Do the consoles work for you? With the images produced by VBox 6 I couldn't get them to work (using rdesktop-vrdp on linux).

The consoles look very unfamiliar, especially the top output.
It looks like a tailed logfile and since it updates after a few seconds it is hard to read.
The process output at console 2 is much better since it updates less frequent.

Part of the problem might be that I monitor the tasks from a remote machine which always makes the consoles a bit sluggish.



The task is now running for 3.5 h and has just finished Event nr. 15.
ID: 6617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 116
Credit: 1,188,675
RAC: 117
Message 6618 - Posted: 12 Sep 2019, 14:39:47 UTC - in response to Message 6617.  

Ok, I went back to VirtualBox 5.2.32 and made a new vdi for version 0.84. I hope it fixes the issues with Vbox version 5. It's even slightly smaller than the previous version but maybe that's because I'm getting more efficient at creating these images :)

I think the "top" output is messed up because the default console size is larger than before, so I've fixed it to fit the larger window.

As for the device errors, the disk setup is just the default from CentOS 7 except I now use xfs instead of LVM.
ID: 6618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 454
Credit: 1,270,941
RAC: 0
Message 6619 - Posted: 12 Sep 2019, 15:44:39 UTC
Last modified: 12 Sep 2019, 16:35:39 UTC

F3 in the Console scrolls every 5 sec thru the screen.
Console is from Virtualbox (show VM) in Boincmanager Grafic and RDP is not shown.
Hypervisor failed is again after 10 min.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2821809
ID: 6619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6620 - Posted: 12 Sep 2019, 15:47:04 UTC - in response to Message 6618.  

This time I got a task from the dev server.

I guess this URL will change to an official host when you move it to the production environment:
<hostname censored> 3180 - - [12/Sep/2019:17:16:58 +0200] "GET http://dcameron.web.cern.ch/dcameron/dev/ATLASJobWrapper-test.sh HTTP/1.1" 200 7280 "-" "curl/7.29.0" TCP_REFRESH_MODIFIED:HIER_DIRECT


Port 25085 is closed at my main firewall.
This causes lots of failed requests like this:
<hostname censored> 3128 - - [12/Sep/2019:17:20:00 +0200] "GET http://pandaserver.cern.ch:25085/cache/schedconfig/BOINC-TEST.all.json HTTP/1.1" 503 4346 "-" "Python-urllib/2.7" TCP_MISS:HIER_DIRECT

Somehow the task got a job from another source and is now processing events.


Top output at console 3 is a bit better than before but still hard to read.
ID: 6620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 287
Credit: 353,387
RAC: 0
Message 6621 - Posted: 12 Sep 2019, 16:29:53 UTC - in response to Message 6620.  

Top output at console 3 is a bit better than before but still hard to read.

@David

To work out a method that makes the top output less sluggish.
Be so kind as to post what method you have currently implemented and why.
ID: 6621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : ATLAS Application : Testing CentOS 7 vbox image


©2021 CERN