Message boards :
ATLAS Application :
Testing CentOS 7 vbox image
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
Have to wait with the CentOS7 -dev after the two Atlas-Production with always 6 CPU's are finished. Than can save Boinc and hopefully will start the CentOs7 again. Will not test a suspend because of your experience, Crystal. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
Edit: RAM 4.800 MByte. RDP-Console is in Boinc not useable. Have installed a app_config. 4 times a new Start Always Hypervisor failed. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
Is this variable from Virtualbox fault, when Hypervisor failed? VERR_SVM_IN_USE (For AMD). |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2813576 This ATLAS VM started until the login screen appeared on all consoles. Then it remained idle. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Compacting the ATLAS vdi file David Cameron wrote: I've released version 0.81 which is a little smaller (3.4GB). The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted. So while "df" inside the VM reports 2.2GB used, the vdi is still 3.4GB, even after compacting. To check the partition layout of the vdi file I attached it as /dev/sdb to a self made VM and found the following partitions: /dev/sdb1: xfs, 1.00 GiB /dev/sdb2: lvm2 pv (centos), 18.05 GiB /dev/sdb3: extended /dev/sdb5: linux-swap, 972.00 MiB The reason why this vdi can't be compacted might be that /dev/sdb2 is part of a logical volume. Hence I tried to reformat the partition with ext4. Step 1: Clone ATLAS_vbox_0.81_image.vdi as ATLAS_vbox_0.81_image2.vdi and attach the clone as /dev/sdc to the VM. Step 2: Reformat /dev/sdc2 with ext4 and reboot the VM. Step 3: Copy all files from /dev/sdb2 to /dev/sdc2 (rsync). Step 4: Zero non used space on /dev/sdc1 and /dev/sdc2 (required for "VBoxManage ... --compact"!) To do this use temporary mountpoints /sdc/sdc1 and /sdc/sdc2 Then run: cat /dev/zero >/sdc/sdc1/tmpzero rm /sdc/sdc1/tmpzero cat /dev/zero >/sdc/sdc2/tmpzero rm /sdc/sdc2/tmpzero Step 5: Compact ATLAS_vbox_0.81_image2.vdi VBoxManage modifyhd ATLAS_vbox_0.81_image2.vdi --compact Results: Original size: 3.1 GiB Compacted (including the CVMFS cache): 2.3 GiB Compacted (CVMFS cache removed): 897 MiB Conclusion A vdi partition formatted as part of an lvm volume seems to be overkill for a small standalone VM, especially as the vdi can't be compacted any more. Instead a reliable filesystem like ext4 (or XFS) should be used as default. The method above is not intended to be used by the volunteers but (starting with step 4) should be used by the vdi maintainers before the vdi is released for production. This would result in faster downloads as well as in a faster VM setup on the client computer since the vdi has to be copied from the project folder to the slots folder every time a new task starts. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
Is this variable from Virtualbox fault, when Hypervisor failed? The deadline was reached yesterday, so canceled it Always Hypervisor failed: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812272 |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
The method above is not intended to be used by the volunteers but (starting with step 4) should be used by the vdi maintainers before the vdi is released for production. I always do these last steps just before releasing a new vdi, but it could be as you say that the logical volume doesn't allow compacting. I will try to create a new VM with ext4 formatting to see if it helps. Thanks for the tips. In the meantime I tried creating the image with Vbox version 6, and released 0.82 which has a 2.8GB image. This one should also fix the strange error you mentioned above where the CVMFS check failed and the WU got stuck doing nothing. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I made a new image with an xfs file system for version 0.83. It seems it saves a little space but not that much (~100MB). |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,600 RAC: 20 |
David, is it possible to add the size of the HITS-file when moving to the shared folder? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
David, is it possible to add the size of the HITS-file when moving to the shared folder? You have the size in the log messages: 2019-09-10 17:12:54 (166330): Guest Log: Looking for outputfile HITS.000649-2086812-31151._078090.pool.root.1 2019-09-10 17:12:54 (166330): Guest Log: HITS file was successfully produced 2019-09-10 17:12:54 (166330): Guest Log: -rw-------. 1 atlas atlas 9091177 Sep 10 15:12 /home/atlas/RunAtlas/HITS.000649-2086812-31151._078090.pool.root.1 "9091177" is the size of the HITS file. These test WU only process 10 events so it's much smaller than the real WU results. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,600 RAC: 20 |
Sorry David, I must have mixed up the results of vbox versus native version. Thus wrong thread. I don't see the size of the HITS-file in the native version. One of your results: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2819384 Your test workunits for v0.82 and v0.83 were quickly distributed, I suppose. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
Always Hypervisor failed: David, is it possible to get a information to use only Virtualbox 6.0.x when the task starts. For 5.2 Hypervisor failed after 10 Min. and you need to delete the Boinc-VM in Virtualbox manually. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
I manually downloaded ATLAS_vbox_0.83_image.vdi and as the dev server has no tasks available I used the vdi to replace the original ATLAS-vdi from a test client that is attached to the production server. That way the test client got a task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=245678240 The task is processing hits but as it is a 1-core setup it will take a while until it is finished. The stderr.txt already shows some lines that should be investigated: 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f 2019-09-12 10:50:24 (15611): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f 2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680177 main Executable: /opt/VBoxGuestAdditions-6.0.12/sbin/VBoxService 2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680183 main Process ID: 2151 2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.680183 main Package type: LINUX_64BITS_GENERIC 2019-09-12 10:51:03 (15611): Guest Log: 10:51:00.779481 main 6.0.12 r133076 started. Verbose level = 0 The latter section points out that the VM uses vbox extensions 6.0.12. While this seems to work on my hosts (currently version 6.0.10) it might cause problems if the host runs older vbox extensions as maeax mentioned. BTW: The vdi size is 2.4 GiB (uncompressed) and since the CVMFS cache is prefilled with roughly 1.5 GiB I couldn't compact it to a smaller size. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I have put a lot more tasks in here now, it seems that some hosts take a lot of tasks at the same time which means there are none left for others. I wonder if I should go back to VBox 5.2 if there are issues with version 6. I see some results look ok with 5.2 though. Those int13_harddisk lines in the stderr have been there in every WU since I started this test, I tried to find out what causes it but couldn't find any information anywhere. I don't think it affects the tasks though. Do the consoles work for you? With the images produced by VBox 6 I couldn't get them to work (using rdesktop-vrdp on linux). |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
No David, not back to 5.2.x, but a text for those who have this Version 5.2. Hmm, will take a look why other running 5.2.x well. Have other Atlas from production running parallell and saw this line: Is this variable from Virtualbox fault, when Hypervisor failed? VERR_SVM_IN_USE (For AMD). |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Those int13_harddisk lines in the stderr have been there in every WU since I started this test, I tried to find out what causes it but couldn't find any information anywhere. It usually indicates a missing device (harddisk) so I first thought it was caused by the logical volume layout. Since the partitions are now XFS it must be caused by another reason. Did you (or centos) configure some kind of raid when you initially started to set up the VM? I don't think it affects the tasks though. Right. Do the consoles work for you? With the images produced by VBox 6 I couldn't get them to work (using rdesktop-vrdp on linux). The consoles look very unfamiliar, especially the top output. It looks like a tailed logfile and since it updates after a few seconds it is hard to read. The process output at console 2 is much better since it updates less frequent. Part of the problem might be that I monitor the tasks from a remote machine which always makes the consoles a bit sluggish. The task is now running for 3.5 h and has just finished Event nr. 15. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Ok, I went back to VirtualBox 5.2.32 and made a new vdi for version 0.84. I hope it fixes the issues with Vbox version 5. It's even slightly smaller than the previous version but maybe that's because I'm getting more efficient at creating these images :) I think the "top" output is messed up because the default console size is larger than before, so I've fixed it to fit the larger window. As for the device errors, the disk setup is just the default from CentOS 7 except I now use xfs instead of LVM. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 297 |
F3 in the Console scrolls every 5 sec thru the screen. Console is from Virtualbox (show VM) in Boincmanager Grafic and RDP is not shown. Hypervisor failed is again after 10 min. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2821809 |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
This time I got a task from the dev server. I guess this URL will change to an official host when you move it to the production environment: <hostname censored> 3180 - - [12/Sep/2019:17:16:58 +0200] "GET http://dcameron.web.cern.ch/dcameron/dev/ATLASJobWrapper-test.sh HTTP/1.1" 200 7280 "-" "curl/7.29.0" TCP_REFRESH_MODIFIED:HIER_DIRECT Port 25085 is closed at my main firewall. This causes lots of failed requests like this: <hostname censored> 3128 - - [12/Sep/2019:17:20:00 +0200] "GET http://pandaserver.cern.ch:25085/cache/schedconfig/BOINC-TEST.all.json HTTP/1.1" 503 4346 "-" "Python-urllib/2.7" TCP_MISS:HIER_DIRECT Somehow the task got a job from another source and is now processing events. Top output at console 3 is a bit better than before but still hard to read. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Top output at console 3 is a bit better than before but still hard to read. @David To work out a method that makes the top output less sluggish. Be so kind as to post what method you have currently implemented and why. |
©2024 CERN