Message boards : ATLAS Application : Testing CentOS 7 vbox image
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6569 - Posted: 23 Aug 2019, 9:28:40 UTC

As mentioned on the main site boards we plan to update the vbox image to CentOS 7. I have created a CentOS 7 image for app version 0.80 and will send some test WU here shortly. These will be short tasks and we won't use the data so don't worry about failures. Please test it out and let us know of any issues.

Unfortunately the vdi is rather large... around 4GB (1.7GB compressed download) because the CVMFS cache is filled with all the required software for running the task. This should hopefully mean network traffic for a running WU is minimised.
ID: 6569 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6570 - Posted: 23 Aug 2019, 16:57:27 UTC - in response to Message 6569.  

I'm testing this atm and have set it up for a 4-core VM.
After the VM booted, the 4 athena.py's were running after 5.5 minutes up time.

Considering the large vdi.
You did not increased the <rsc_disk_bound>8000000000.000000</rsc_disk_bound>.

After the first few events were processed, I suspended the task forcing to save the VM to disk.
The size of the slot occupied grew to 8,470,429,696 bytes, so in normal circumstances the task would have error-ed out due to DISK_LIMIT_EXCEEDED.

So please double the <rsc_disk_bound> to 16000000000.000000

Result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2804982
ID: 6570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6572 - Posted: 26 Aug 2019, 9:52:18 UTC - in response to Message 6570.  

Thanks for testing. I've increased rsc_disk_bound to 16GB.

I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state.
ID: 6572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6573 - Posted: 26 Aug 2019, 10:19:45 UTC - in response to Message 6572.  

I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state.


Actually I couldn't recreate this problem myself. I suspended this WU and on resuming it picked up from where it left off: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807397

I have BOINC configured not to keep suspended tasks in memory.
ID: 6573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6574 - Posted: 26 Aug 2019, 10:37:48 UTC - in response to Message 6573.  
Last modified: 26 Aug 2019, 10:42:37 UTC

I notice that your WU restarted from the beginning after resuming the VM, so I'll look into why it didn't restart from its previous state.


Actually I couldn't recreate this problem myself. I suspended this WU and on resuming it picked up from where it left off: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807397

I have BOINC configured not to keep suspended tasks in memory.

That task was almost 3 days ago, but as far as I remember it restored from the saved state after the resume, but I see what you mean: copied input files twice.
In the result however you do not see a 'restoring from snapshot', probably it's not a real snapshot, but a saved state ??
After the resume it used 13 minutes to complete, so I think it was not a complete restart with boot etc etc.

A task without interruption lasted longer: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2805079
ID: 6574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6575 - Posted: 26 Aug 2019, 10:47:43 UTC
Last modified: 26 Aug 2019, 10:58:00 UTC

Another task where I see the same.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2805090
Will test again...
ID: 6575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6576 - Posted: 26 Aug 2019, 11:04:38 UTC - in response to Message 6575.  
Last modified: 26 Aug 2019, 11:13:05 UTC

Will test again...

No success. Task is running, but
2019-08-26 12:53:22 (3928): Guest Log: Failed to check CVMFS:
2019-08-26 12:53:22 (3928): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-08-26 12:53:26 (3928): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK
2019-08-26 12:53:27 (3928): Guest Log: Probing /cvmfs/grid.cern.ch... OK

Edit: Forced a new boot and CVMFS check was OK this time. Will suspend after the first checkpoint. I set the checkpoints to 10 minutes.
ID: 6576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6577 - Posted: 26 Aug 2019, 12:12:22 UTC
Last modified: 26 Aug 2019, 12:56:35 UTC

Finished result with several suspends en resumes: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807398

I reproduced in mind that Friday task. After suspend (LAIM off), the task is saved to disk.
I checked now that the saving of that huge VM is lasting about 30 - 35 seconds. BOINC has a hard coded time limit of 60 seconds.
The task(s) that started from scratch didn't make a save within that 60 seconds and the VM-state was not saved, but stopped.
In that case after a resume the VM is booted from scratch and all events have to be processed again. With 200 events a pity :(

Edit: With a next task saving to disk becomes more critical. From VBox.log:

00:12:11.009959 Console: Machine state changed to 'Saving'
00:12:52.665204 Console: Machine state changed to 'Saved'

00:04:04.917323 Console: Machine state changed to 'Saving'
00:04:46.895925 Console: Machine state changed to 'Saved'
ID: 6577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6578 - Posted: 26 Aug 2019, 17:46:30 UTC

I've tested several tasks and suspended the task with "Leave application in memory" set to off.
All saved VM's were saved OK with a 'Saved state' and no one had a Stopped or another state.
After a resume however it happens rather often, that in the result log it looks like the VM has booted and input files are copied,
but in fact the processing goes on where it was left off before the suspend.

Example task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2807395
ID: 6578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6579 - Posted: 27 Aug 2019, 5:31:59 UTC - in response to Message 6578.  
Last modified: 27 Aug 2019, 5:37:25 UTC

After the first real boot and after the resume several log lines seem like repeated.

Look at the Log time in this line twice the same time at different moments.

2019-08-26 18:44:11 (4380): Guest Log: 00:00:00.000932 main Log opened 2019-08-26T18:44:09.602254000Z
2019-08-26 19:11:36 (2788): Guest Log: 00:00:00.000932 main Log opened 2019-08-26T18:44:09.602254000Z

and even more wondering

2019-08-26 18:44:21 (4380): Guest Log: 00:00:10.016066 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 190 129 137 000ns (GuestNow=1 566 837 859 483 282 000 ns GuestLast=1 566 845 049 612 419 000 ns fSetTimeLastLoop=true )
2019-08-26 19:11:36 (2788): Guest Log: 00:00:10.016066 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 190 129 137 000ns (GuestNow=1 566 837 859 483 282 000 ns GuestLast=1 566 845 049 612 419 000 ns fSetTimeLastLoop=true )
ID: 6579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6581 - Posted: 27 Aug 2019, 14:32:01 UTC - in response to Message 6579.  

Thanks a lot for these investigations. If the size of the image is the problem I will look into reducing it. However I would have thought that a regular task that is several hours into processing would also have a large disk image to save.

For those tasks where you see the log repeated can you check if the task was started again, i.e. a second task running in parallel to the first? You can check this from the top output on the console for example, if there are more athena.py processes than expected.
ID: 6581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6582 - Posted: 27 Aug 2019, 17:02:44 UTC - in response to Message 6581.  

Thanks a lot for these investigations. If the size of the image is the problem I will look into reducing it. However I would have thought that a regular task that is several hours into processing would also have a large disk image to save.
The size itself is not a problem since you doubled the rsc_disk_bound. For some users the download of 1745879040 bytes of the gzipped-vdi could be a problem.
Saving to disk was for me always within the mentioned BOINC 60 seconds hard coded limit with other threads busy. Noted up to 42 seconds. Theory needs only a few seconds.
I'm using a secondary 7200rpm hard-disk for BOINC and data. The OS is on another disk.
However it's always a good idea to make the image as clean/small as possible. Think about old log files, error dumps, really unneeded software etc etc.
I also ran a few tasks single core and then the saved image was much smaller. But for me a 4-core ATLAS is the best setup.
When going into production, advice the users to run BOINC with 'Keep applications in memory', when enough RAM available.
The other advice still is to run a ATLAS-task without interruption. Reducing the number of events to e.g. 100 could help a bit.

There could be a negative surprise in saving time when suspending a task that already had done much more events . . .. Maybe we should test that here first.

For those tasks where you see the log repeated can you check if the task was started again, i.e. a second task running in parallel to the first? You can check this from the top output on the console for example, if there are more athena.py processes than expected.
I can be clear about that.
The processes were not restarted and also no new tasks/processes were added. Mostly saw 4, sometimes 5 athena.py's. Except for the last 3, 2 or 1 event(s) of course.
Also the run times for those active athena.py processes was adding to the times before the suspend.
ID: 6582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6583 - Posted: 30 Aug 2019, 12:12:03 UTC - in response to Message 6582.  

I'm trying to get the smallest image possible, but:


    The "minimal" CentOS 7 installation takes 1.2GB
    After removing everything possible I got it down to 600MB
    Installing CVMFS and dependencies takes it back up to 800MB
    Running a task adds around 1GB to the CVMFS cache



So realistically ~2GB is the lowest we can go. I'll make a new release soon with a smaller image.

ID: 6583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6584 - Posted: 30 Aug 2019, 13:12:58 UTC - in response to Message 6583.  

So realistically ~2GB is the lowest we can go. I'll make a new release soon with a smaller image.
That would be a great improvement.
The ATLAS_vbox_0.80_image.vdi now is 4.05GB without running a job.
ID: 6584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6590 - Posted: 2 Sep 2019, 9:06:09 UTC - in response to Message 6584.  

I've released version 0.81 which is a little smaller (3.4GB). The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted. So while "df" inside the VM reports 2.2GB used, the vdi is still 3.4GB, even after compacting.

Another radical option is to not pre-fill the CVMFS cache in the VM which would allow a relatively small vdi. But this would lead to large network traffic and poor efficiency for every WU unless people set up local squid servers, which I feel is a bit too much to ask the average volunteer.
ID: 6590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 651
Credit: 1,678,202
RAC: 1
Message 6591 - Posted: 2 Sep 2019, 9:20:27 UTC - in response to Message 6590.  
Last modified: 2 Sep 2019, 9:52:19 UTC

The problem is that VirtualBox has a feature where if you write to disk it doesn't actually free space when files are deleted.

Have 5.2.32 and testing it with this Version on Windows - 6.0.10 is now avalaible for Virtualbox.
No tasks are avalaible for Atlas Simulation. :-(
Have also made a reset of the Project.
ID: 6591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6592 - Posted: 2 Sep 2019, 13:24:31 UTC - in response to Message 6591.  

I can't get any tasks either... even though there are a few unsent in the queue.

I don't see anything in the server logs and I also restarted the daemons, but it didn't help. I'll ask the admins.
ID: 6592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6593 - Posted: 2 Sep 2019, 16:25:42 UTC - in response to Message 6592.  

I can't get any tasks either... even though there are a few unsent in the queue.
To verify the scheduler, I requested a Theory native and got one.
ID: 6593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 651
Credit: 1,678,202
RAC: 1
Message 6594 - Posted: 4 Sep 2019, 7:02:19 UTC
Last modified: 4 Sep 2019, 7:27:22 UTC

The 0.81 .vdi is 1,17 Gbyte in Download. The task have 348,3 Mbyte.
Virtualbox 5.2.32 in use therefore. One Task with 2 CPU's.
Edit: RAM 4.800 MByte. RDP-Console is in Boinc not useable.
VM Hypervisor failed to enter an online state in a timely fashion.
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3452
This are the last lines in the log:
00:00:13.627233 VMMDev: Guest Log: 00:00:00.031357 automount vbsvcAutoMountWorker: Shared folder 'shared' was mounted to '/media/sf_shared'
00:00:13.909407 VMMDev: Guest Log: Checking CVMFS...
00:00:22.547133 VMMDev: Guest Log: CVMFS is ok
00:00:22.732485 VMMDev: Guest Log: Mounting shared directory
00:00:22.887241 VMMDev: Guest Log: Copying input files
00:00:23.616670 VMMDev: Guest Log: 00:00:10.020797 timesync vgsvcTimeSyncWorker: Radical guest time change: -7 188 080 583 000ns (GuestNow=1 567 580 676 779 799 000 ns GuestLast=1 567 587 864 860 382 000 ns fSetTimeLastLoop=true )
00:00:28.848450 VMMDev: Guest Log: Copied input files into RunAtlas.
00:00:30.157209 VMMDev: Guest Log: copied the webapp to /var/www
00:00:30.246053 VMMDev: Guest Log: This vm does not need to setup an http proxy
00:00:30.251200 VMMDev: Guest Log: ATHENA_PROC_NUMBER=2
00:00:30.415911 VMMDev: Guest Log: *** Starting ATLAS job. (PandaID=4002876565 taskID=000649-3) ***
00:02:17.732640 VMMDev: SetVideoModeHint: Got a video mode hint (800x600x0)@(0x0),(1;0) at 0
ID: 6594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 786,094
RAC: 122
Message 6595 - Posted: 4 Sep 2019, 14:26:54 UTC

I picked up 2 tasks. The first one was a disaster. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812406
I had 8 Theory's running with 90% executing cap.
To start the 4-core ATLAS task, I suspended 4 of the Theory´s. So far so good. ATLAS started running and could see ´top´ and ALT F2 events processing.
Meanwhile I resumed the 4 Theory´s (other 4 still running). To test the ATLAS suspend behaviour, I suspended the task.
The 4 Theory´s started running immediately, but The ATLAS task had to create a save set in the background. It lasted too long
00:28:18.645496 Changing the VM state from 'RUNNING' to 'SUSPENDING'
00:28:18.922246 PDMR3Suspend: 276 703 318 ns run time
00:28:18.922269 Changing the VM state from 'SUSPENDING' to 'SUSPENDED'
00:28:18.922300 Console: Machine state changed to 'Paused'
00:28:39.638447 Console: Machine state changed to 'Saving'
00:28:39.640394 Changing the VM state from 'SUSPENDED' to 'SAVING'

00:29:34.186831 SSM: Successfully saved the VM state to 'D:\Boinc1\slots\16\boinc_a33779c40ba690f5\Snapshots\2019-09-04T10-32-47-743120300Z.sav'
00:29:34.186860 Changing the VM state from 'SAVING' to 'SUSPENDED'
00:29:34.186909 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Saving, InUninit=0)
00:29:34.608809 VRDP: Logoff: AH1 (127.0.0.1) build 9600. User: [] Domain: [] Reason 0x0001.
00:29:34.608975 VRDP: Connection closed: 1
00:29:34.613116 VMMDevNotifyGuest: fAddEvents=0x100 ignored because enmVMState=15
00:29:34.613148 VBVA: VRDP acceleration has been disabled.
00:29:34.991018 VRDP: TCP server closed.
00:29:35.467539 Changing the VM state from 'SUSPENDED' to 'POWERING_OFF'

00:29:36.400340 Changing the VM state from 'POWERING_OFF' to 'OFF'

Although it´s written that a save set is created, it wasn´t there. Probably the wrapper cleaned it because the save was lasting longer than 60 seconds.
The VM had not a saved state, but an aborted state.
After resume it looks like it started with a boot, but at the end no HITS file was created.
Despite of all your work to shrink the image, saving an image is not faster.

The second task https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2812254, I tested with only 4 Theory´s running.
The save to disk lasted 42 seconds and it produced a HITS file at the end.
ID: 6595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : ATLAS Application : Testing CentOS 7 vbox image


©2024 CERN