Message boards :
ATLAS Application :
Testing CentOS 7 vbox image
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
As mentioned on the main site boards we plan to update the vbox image to CentOS 7. I have created a CentOS 7 image for app version 0.80 and will send some test WU here shortly. These will be short tasks and we won't use the data so don't worry about failures. Please test it out and let us know of any issues. Unfortunately the vdi is rather large... around 4GB (1.7GB compressed download) because the CVMFS cache is filled with all the required software for running the task. This should hopefully mean network traffic for a running WU is minimised. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I'm testing this atm and have set it up for a 4-core VM. After the VM booted, the 4 athena.py's were running after 5.5 minutes up time. Considering the large vdi. You did not increased the <rsc_disk_bound>8000000000.000000</rsc_disk_bound>. After the first few events were processed, I suspended the task forcing to save the VM to disk. The size of the slot occupied grew to 8,470,429,696 bytes, so in normal circumstances the task would have error-ed out due to DISK_LIMIT_EXCEEDED. So please double the <rsc_disk_bound> to 16000000000.000000 Result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2804982 |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Thanks for testing. I've increased rsc_disk_bound to 16GB. I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state. Actually I couldn't recreate this problem myself. I suspended this WU and on resuming it picked up from where it left off: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807397 I have BOINC configured not to keep suspended tasks in memory. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state. That task was almost 3 days ago, but as far as I remember it restored from the saved state after the resume, but I see what you mean: copied input files twice. In the result however you do not see a 'restoring from snapshot', probably it's not a real snapshot, but a saved state ?? After the resume it used 13 minutes to complete, so I think it was not a complete restart with boot etc etc. A task without interruption lasted longer: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2805079 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
Another task where I see the same. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2805090 Will test again... |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
Will test again... No success. Task is running, but 2019-08-26 12:53:22 (3928): Guest Log: Failed to check CVMFS: 2019-08-26 12:53:22 (3928): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2019-08-26 12:53:26 (3928): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK 2019-08-26 12:53:27 (3928): Guest Log: Probing /cvmfs/grid.cern.ch... OK Edit: Forced a new boot and CVMFS check was OK this time. Will suspend after the first checkpoint. I set the checkpoints to 10 minutes. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
Finished result with several suspends en resumes: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807398 I reproduced in mind that Friday task. After suspend (LAIM off), the task is saved to disk. I checked now that the saving of that huge VM is lasting about 30 - 35 seconds. BOINC has a hard coded time limit of 60 seconds. The task(s) that started from scratch didn't make a save within that 60 seconds and the VM-state was not saved, but stopped. In that case after a resume the VM is booted from scratch and all events have to be processed again. With 200 events a pity :( Edit: With a next task saving to disk becomes more critical. From VBox.log: 00:12:11.009959 Console: Machine state changed to 'Saving' 00:12:52.665204 Console: Machine state changed to 'Saved' 00:04:04.917323 Console: Machine state changed to 'Saving' 00:04:46.895925 Console: Machine state changed to 'Saved' |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I've tested several tasks and suspended the task with "Leave application in memory" set to off. All saved VM's were saved OK with a 'Saved state' and no one had a Stopped or another state. After a resume however it happens rather often, that in the result log it looks like the VM has booted and input files are copied, but in fact the processing goes on where it was left off before the suspend. Example task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807395 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
After the first real boot and after the resume several log lines seem like repeated. Look at the Log time in this line twice the same time at different moments. 2019-08-26 18:44:11 (4380): Guest Log: 00:00:00.000932 main Log opened 2019-08-26T18:44:09.602254000Z 2019-08-26 19:11:36 (2788): Guest Log: 00:00:00.000932 main Log opened 2019-08-26T18:44:09.602254000Z and even more wondering 2019-08-26 18:44:21 (4380): Guest Log: 00:00:10.016066 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 190 129 137 000ns (GuestNow=1 566 837 859 483 282 000 ns GuestLast=1 566 845 049 612 419 000 ns fSetTimeLastLoop=true ) 2019-08-26 19:11:36 (2788): Guest Log: 00:00:10.016066 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 190 129 137 000ns (GuestNow=1 566 837 859 483 282 000 ns GuestLast=1 566 845 049 612 419 000 ns fSetTimeLastLoop=true ) |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Thanks a lot for these investigations. If the size of the image is the problem I will look into reducing it. However I would have thought that a regular task that is several hours into processing would also have a large disk image to save. For those tasks where you see the log repeated can you check if the task was started again, i.e. a second task running in parallel to the first? You can check this from the top output on the console for example, if there are more athena.py processes than expected. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
Thanks a lot for these investigations. If the size of the image is the problem I will look into reducing it. However I would have thought that a regular task that is several hours into processing would also have a large disk image to save.The size itself is not a problem since you doubled the rsc_disk_bound. For some users the download of 1745879040 bytes of the gzipped-vdi could be a problem. Saving to disk was for me always within the mentioned BOINC 60 seconds hard coded limit with other threads busy. Noted up to 42 seconds. Theory needs only a few seconds. I'm using a secondary 7200rpm hard-disk for BOINC and data. The OS is on another disk. However it's always a good idea to make the image as clean/small as possible. Think about old log files, error dumps, really unneeded software etc etc. I also ran a few tasks single core and then the saved image was much smaller. But for me a 4-core ATLAS is the best setup. When going into production, advice the users to run BOINC with 'Keep applications in memory', when enough RAM available. The other advice still is to run a ATLAS-task without interruption. Reducing the number of events to e.g. 100 could help a bit. There could be a negative surprise in saving time when suspending a task that already had done much more events . . .. Maybe we should test that here first. For those tasks where you see the log repeated can you check if the task was started again, i.e. a second task running in parallel to the first? You can check this from the top output on the console for example, if there are more athena.py processes than expected.I can be clear about that. The processes were not restarted and also no new tasks/processes were added. Mostly saw 4, sometimes 5 athena.py's. Except for the last 3, 2 or 1 event(s) of course. Also the run times for those active athena.py processes was adding to the times before the suspend. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I'm trying to get the smallest image possible, but: The "minimal" CentOS 7 installation takes 1.2GB After removing everything possible I got it down to 600MB Installing CVMFS and dependencies takes it back up to 800MB Running a task adds around 1GB to the CVMFS cache
|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
So realistically ~2GB is the lowest we can go. I'll make a new release soon with a smaller image.That would be a great improvement. The ATLAS_vbox_0.80_image.vdi now is 4.05GB without running a job. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I've released version 0.81 which is a little smaller (3.4GB). The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted. So while "df" inside the VM reports 2.2GB used, the vdi is still 3.4GB, even after compacting. Another radical option is to not pre-fill the CVMFS cache in the VM which would allow a relatively small vdi. But this would lead to large network traffic and poor efficiency for every WU unless people set up local squid servers, which I feel is a bit too much to ask the average volunteer. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 362 |
The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted. Have 5.2.32 and testing it with this Version on Windows - 6.0.10 is now avalaible for Virtualbox. No tasks are avalaible for Atlas Simulation. :-( Have also made a reset of the Project. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I can't get any tasks either... even though there are a few unsent in the queue. I don't see anything in the server logs and I also restarted the daemons, but it didn't help. I'll ask the admins. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I can't get any tasks either... even though there are a few unsent in the queue.To verify the scheduler, I requested a Theory native and got one. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 1,990,358 RAC: 362 |
The 0.81 .vdi is 1,17 Gbyte in Download. The task have 348,3 Mbyte. Virtualbox 5.2.32 in use therefore. One Task with 2 CPU's. Edit: RAM 4.800 MByte. RDP-Console is in Boinc not useable. VM Hypervisor failed to enter an online state in a timely fashion. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3452 This are the last lines in the log: 00:00:13.627233 VMMDev: Guest Log: 00:00:00.031357 automount vbsvcAutoMountWorker: Shared folder 'shared' was mounted to '/media/sf_shared' 00:00:13.909407 VMMDev: Guest Log: Checking CVMFS... 00:00:22.547133 VMMDev: Guest Log: CVMFS is ok 00:00:22.732485 VMMDev: Guest Log: Mounting shared directory 00:00:22.887241 VMMDev: Guest Log: Copying input files 00:00:23.616670 VMMDev: Guest Log: 00:00:10.020797 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 188 080 583 000ns (GuestNow=1 567 580 676 779 799 000 ns GuestLast=1 567 587 864 860 382 000 ns fSetTimeLastLoop=true ) 00:00:28.848450 VMMDev: Guest Log: Copied input files into RunAtlas. 00:00:30.157209 VMMDev: Guest Log: copied the webapp to /var/www 00:00:30.246053 VMMDev: Guest Log: This vm does not need to setup an http proxy 00:00:30.251200 VMMDev: Guest Log: ATHENA_PROC_NUMBER=2 00:00:30.415911 VMMDev: Guest Log: *** Starting ATLAS job. (PandaID=4002876565 taskID=000649-3) *** 00:02:17.732640 VMMDev: SetVideoModeHint: Got a video mode hint (800x600x0)@(0x0),(1;0) at 0 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 857,561 RAC: 22 |
I picked up 2 tasks. The first one was a disaster. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812406 I had 8 Theory's running with 90% executing cap. To start the 4-core ATLAS task, I suspended 4 of the Theory´s. So far so good. ATLAS started running and could see ´top´ and ALT F2 events processing. Meanwhile I resumed the 4 Theory´s (other 4 still running). To test the ATLAS suspend behaviour, I suspended the task. The 4 Theory´s started running immediately, but The ATLAS task had to create a save set in the background. It lasted too long 00:28:18.645496 Changing the VM state from 'RUNNING' to 'SUSPENDING' 00:28:18.922246 PDMR3Suspend: 276 703 318 ns run time 00:28:18.922269 Changing the VM state from 'SUSPENDING' to 'SUSPENDED' 00:28:18.922300 Console: Machine state changed to 'Paused' 00:28:39.638447 Console: Machine state changed to 'Saving' 00:28:39.640394 Changing the VM state from 'SUSPENDED' to 'SAVING' 00:29:34.186831 SSM: Successfully saved the VM state to 'D:\Boinc1\slots\16\boinc_a33779c40ba690f5\Snapshots\2019-09-04T10-32-47-743120300Z.sav' 00:29:34.186860 Changing the VM state from 'SAVING' to 'SUSPENDED' 00:29:34.186909 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Saving, InUninit=0) 00:29:34.608809 VRDP: Logoff: AH1 (127.0.0.1) build 9600. User: [] Domain: [] Reason 0x0001. 00:29:34.608975 VRDP: Connection closed: 1 00:29:34.613116 VMMDevNotifyGuest: fAddEvents=0x100 ignored because enmVMState=15 00:29:34.613148 VBVA: VRDP acceleration has been disabled. 00:29:34.991018 VRDP: TCP server closed. 00:29:35.467539 Changing the VM state from 'SUSPENDED' to 'POWERING_OFF' 00:29:36.400340 Changing the VM state from 'POWERING_OFF' to 'OFF' Although it´s written that a save set is created, it wasn´t there. Probably the wrapper cleaned it because the save was lasting longer than 60 seconds. The VM had not a saved state, but an aborted state. After resume it looks like it started with a boot, but at the end no HITS file was created. Despite of all your work to shrink the image, saving an image is not faster. The second task https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812254, I tested with only 4 Theory´s running. The save to disk lasted 42 seconds and it produced a HITS file at the end. |
©2024 CERN