Message boards : Number crunching : An Alternative Approach For VM applications
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5546 - Posted: 27 Sep 2018, 14:41:13 UTC
Last modified: 27 Sep 2018, 14:43:55 UTC

Over the past couple of years we have consolidated towards a common platform comprising of a single project for LHC using HTCondor to furnish the VMs with jobs. The approach is outlined in this paper for those that wish to know more details. Our boundary condition is that we require Linux and CVMFS. A recent experiment with OpenHTC.io described in this presentation (and paper soon) improves the performance of CVMFS by placing squid caches closer to the volunteer. One of the issues with the current approach is that as the VMs are rebooted, the local CVMFS cache is destroyed and over time the cache in the image degrades. Another issue is that due to the potentially long boot time of the VM and the 10 minute grace period before shutting down can in some cases result in 20mins of idle time per task. It is for this reason and balancing the need to get regular feedback both in terms of BOINC credit and reliability, that the Tasks are 12-18 hours and then run multiple jobs. ATLAS has shown that it is possible to provide a native application if CVMFS is installed and configured as a prerequisite. This approach results in a persistence CVMFS cache and jobs are pushed through the boinc server rather than via HTCondor. The job is run as a Singularity container so this could become a generic approach such as the boinc2docker application.

The proposal would be that we focus on providing native applications for Linux built upon containers. The purpose of virtualization would therefore be to provide the Linux environment for running the boinc client, CVMFS and the container. For dedicated resources, a VM could be started using any hypervisor. For shared resources, the boinc client on the host could be used to start and manage the VM but the jobs would be managed native in the guest by another instance of the boinc client.

What are your views on the ATLAS native application and this potential new direction?
ID: 5546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 660
Credit: 1,720,123
RAC: 3,061
Message 5547 - Posted: 27 Sep 2018, 17:36:07 UTC

Have SL69 in use inside of Virtualbox for Atlas-native. Since 2-3 weeks with openhtc.io.
Since of the beginning Atlas-native some small errors, but with more than would say 95% successful tasks on 4 Computer.
A very good stability. Atlas need a lot of network-traffic for down/upload.

It would be a good idea to transfer this know how to Theory, LHCb or CMS and if possible Alice.
Singularity is needed for other Linux outside of SL69 or CentOS.

It is a great step to find one way for all LHC-projects.
ID: 5547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 5548 - Posted: 27 Sep 2018, 21:56:25 UTC - in response to Message 5547.  

The way I would like this and pretty much any other member is that we have a testing version and the public version to actually be trusted to work and not get updated and changed over and over in as little as 10 days.

If it was me I would make sure any of these actually worked before sending out hundreds of tasks only to see hundreds of hours of nothing other than Invalids and "computer errors" and then when in the public everyone has questions as to why they didn't work and was it because of their own mistakes and then waiting for a factual response.

We had a version working fine here and as far as the logs I look at and the basic stderr's they were running *jobs* and then the next thing we have several vdi updates that eventually loses computers here and LHC and then members just wait and see if anyone will have some Valid running tasks before even trying again.

I have been running VB tasks since March 1st 2011 and Sixtracks since 2004 and GPU's at Einstein for years and I only want us to use a project version that works since MOST members won't want to sit around staring at logs and stderr's and computers as much as I have since 2004 and even before that when I admit was just a waste of time with seti classic (other than learning everything about computers and programming and building my own)

Lets just get back to work
Mad Scientist For Life
ID: 5548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 660
Credit: 1,720,123
RAC: 3,061
Message 5549 - Posted: 28 Sep 2018, 13:12:02 UTC - in response to Message 5546.  


Laurence asked this question, Magic:

What are your views on the ATLAS native application and this potential new direction?
ID: 5549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 5550 - Posted: 28 Sep 2018, 16:08:46 UTC - in response to Message 5546.  

...The proposal would be that we focus on providing native applications for Linux built upon containers. The purpose of virtualization would therefore be to provide the Linux environment for running the boinc client, CVMFS and the container. For dedicated resources, a VM could be started using any hypervisor. For shared resources, the boinc client on the host could be used to start and manage the VM but the jobs would be managed native in the guest by another instance of the boinc client.

What are your views on the ATLAS native application and this potential new direction?

Not 100% clear what you meant. I suppose you do not allow the user to login into your provided VM else...
In my opinion you could provide a VM with the best/lightest Linux OS for CERN purposes with all most needed instances installed like CVMFS, singularity and BOINC without projects connected.
No need for the user without much Linux knowledge to install and config packages. The VM-image can be used for several host OS's and hypervisors.
No need to control this VM by the host BOINC, cause the VM should be managed by the user him/herself who only have to add the LHC project with his username/pw.
He could even add other BOINC-projects only having Linux apps or where the Linux apps outperform the Windows/Mac-apps.
I have a Mint VM and was, although following the instructions, not able to get native ATLAS to run, so gave up.

... you have to run the inside BOINC with the user credentials.
ID: 5550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 5551 - Posted: 29 Sep 2018, 1:49:23 UTC

As before any new idea can be tested at the Atlas Alpha site first and if found to work then moved here for more testing by more people and then on to LHC and as before I would be testing the Windows version at the Atlas Alpha site instead of just getting everyone running versions that may or may not even work and as I said.....ending up with fewer cores working for Cern.

It doesn't take much to get members to just go to a project such as Einstein when there is rarely any problem and work is done 24/7 by twice as many people and computers.

The main problem with VB is it can never be trusted to work without problems all the time and as I already said (many times) that is all it takes to see computers and members to go to a project that can be trusted to actually work without being watched all the time.

Right now none of the Cern projects can be trusted to run without problems like we have right now at LHC
Mad Scientist For Life
ID: 5551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 1
Message 5552 - Posted: 30 Sep 2018, 18:20:09 UTC

I'm not 100% clear on what's proposed either. [thinking it through while I type so might take a couple of edits]
Not tried the Native Alas as I don't run Linux, and anyway my machines really just aren't powerful enough for what Atlas requires.
I did briefly sample Linux to try some earlier apps here (VMs inside a VM) but couldn't get my head around the command line stuff.

Let's see if I've got this right;
Windows hosts (pobably the majority of private users) run Boinc which controls the VM inside of which runs the computation which downloads the relevant science package (Pythia, Sherpa etc.) as required, which all gets destroyed each time Boinc finishes a Task.
Is the new approach for the host to support a Linux guest VM which contains Boinc, which woud download AND STORE the packages and keep them for whenever they are needed next, with Boinc then controlling the Native apps?

Seems there is potential for reduced bandwidth requirement after the initial setup.
Would the user need to set up Boinc within theVM?
How to earn credits if there isn't a requirement for Boinc to end and report Tasks? (Like the the Christmas Challenges of a few years ago where we gathered MCPlots but no Boincs.) Might be a turnoff for some.
Would there be issues with the user pausing or shutting down the Guest VM?
One of my mahines has a noisy fan so I turn Boinc down to 50% overnight. Could I still do that without getting too Linuxy
ID: 5552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5553 - Posted: 2 Oct 2018, 13:32:45 UTC - in response to Message 5547.  

Maeax

Have SL69 in use inside of Virtualbox for Atlas-native. Since 2-3 weeks with openhtc.io.
Since of the beginning Atlas-native some small errors, but with more than would say 95% successful tasks on 4 Computer.
A very good stability. Atlas need a lot of network-traffic for down/upload.

It would be a good idea to transfer this know how to Theory, LHCb or CMS and if possible Alice.
Singularity is needed for other Linux outside of SL69 or CentOS.

It is a great step to find one way for all LHC-projects.


This is what we will try to do most probably starting with Theory.

Regards,

Laurence
ID: 5553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5554 - Posted: 2 Oct 2018, 13:41:32 UTC - in response to Message 5550.  

CP,

Not 100% clear what you meant. I suppose you do not allow the user to login into your provided VM else...
In my opinion you could provide a VM with the best/lightest Linux OS for CERN purposes with all most needed instances installed like CVMFS, singularity and BOINC without projects connected.
No need for the user without much Linux knowledge to install and config packages. The VM-image can be used for several host OS's and hypervisors.
No need to control this VM by the host BOINC, cause the VM should be managed by the user him/herself who only have to add the LHC project with his username/pw.
He could even add other BOINC-projects only having Linux apps or where the Linux apps outperform the Windows/Mac-apps.
I have a Mint VM and was, although following the instructions, not able to get native ATLAS to run, so gave up.

... you have to run the inside BOINC with the user credentials.


The idea would be that we start with the Native app for those users who already have Linux and are able setup CVMFS. Anyone who would like to run Linux in a VM on Windows or Mac can also run this way. We could provide a preconfigured image to simplify this and yes BOINC inside the VM would always run with the users credentials. Using the same trick we did in this pull request, we could also offer a Web based GUI for managing the internal BOINC client without entering the VM.

Regards,

Laurence
ID: 5554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5555 - Posted: 2 Oct 2018, 13:44:55 UTC - in response to Message 5551.  

Hi Magic,

As before any new idea can be tested at the Atlas Alpha site first and if found to work then moved here for more testing by more people and then on to LHC and as before I would be testing the Windows version at the Atlas Alpha site instead of just getting everyone running versions that may or may not even work and as I said.....ending up with fewer cores working for Cern.

It doesn't take much to get members to just go to a project such as Einstein when there is rarely any problem and work is done 24/7 by twice as many people and computers.

The main problem with VB is it can never be trusted to work without problems all the time and as I already said (many times) that is all it takes to see computers and members to go to a project that can be trusted to actually work without being watched all the time.

Right now none of the Cern projects can be trusted to run without problems like we have right now at LHC


Yes, this is one of the issues we are trying to address. It should work without any problems and be easier than it is at the moment. Only by doing this will we attract more volunteers. Sixtrack runs well and we regularly get over 400K parallel tasks but for theory this is about 4K.

Regards,

Laurence
ID: 5555 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5556 - Posted: 2 Oct 2018, 13:49:15 UTC - in response to Message 5552.  

Ray,

I'm not 100% clear on what's proposed either. [thinking it through while I type so might take a couple of edits]
Not tried the Native Alas as I don't run Linux, and anyway my machines really just aren't powerful enough for what Atlas requires.
I did briefly sample Linux to try some earlier apps here (VMs inside a VM) but couldn't get my head around the command line stuff.

Let's see if I've got this right;
Windows hosts (pobably the majority of private users) run Boinc which controls the VM inside of which runs the computation which downloads the relevant science package (Pythia, Sherpa etc.) as required, which all gets destroyed each time Boinc finishes a Task.
Is the new approach for the host to support a Linux guest VM which contains Boinc, which woud download AND STORE the packages and keep them for whenever they are needed next, with Boinc then controlling the Native apps?
Basically yes.

Seems there is potential for reduced bandwidth requirement after the initial setup.
Would the user need to set up Boinc within theVM?
No, this should be automated

How to earn credits if there isn't a requirement for Boinc to end and report Tasks? (Like the the Christmas Challenges of a few years ago where we gathered MCPlots but no Boincs.) Might be a turnoff for some.
Credit comes from the the boinc client running in the VM.

Would there be issues with the user pausing or shutting down the Guest VM?
No

One of my mahines has a noisy fan so I turn Boinc down to 50% overnight. Could I still do that without getting too Linuxy

Yes, the BOINC on your host would still be able to control resource utilization.

Regards,

Laurence
ID: 5556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyllic

Send message
Joined: 10 Mar 17
Posts: 40
Credit: 108,345
RAC: 0
Message 5557 - Posted: 2 Oct 2018, 18:17:10 UTC - in response to Message 5546.  

I still have one qustion:

    - Would you like to change the way how jobs are distributed as well? So switch from the HTCondor approach to a similar one like ATLAS where the boinc server is responsible for distributing the jobs?


One of the issues with the current approach is that as the VMs are rebooted, the local CVMFS cache is destroyed and over time the cache in the image degrades.
I just had an idea (don't know if that would work): Why not place the CVMFS cache directory within one of the directories that is shared with the host system and does not get deleted (maybe the scratch directory within the project's folder within the Boinc Data directory) and configure CVMFS to use this directory as cache. When a new VM is started, include this directory and the already existing cache can be used again. This way the cache won't get deleted once the VM has shutdown.

What are your views on the ATLAS native application and this potential new direction?
I really like the native ATLAS app. I think the direction you proposed is good since everything that makes it easier for volunteers to set everything up is good (if that is the case). Keeping it as simple as possible for the volunteers is the most important thing!

Have you considered a similar approach as boinc2docker, so something like boinc2singularity? Again, no idea if that approach would work: So providing a very small VM image that runs CVMFS (again place the CVMFS cache directory into a shared directory if thats possible) and is able to execute a singularity image. Then download/execute the prebuild singularity image which then do the actual calculations.
ID: 5557 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 5558 - Posted: 3 Oct 2018, 8:03:49 UTC

Thanks for your reply Laurence
Mad Scientist For Life
ID: 5558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 20 Jun 16
Posts: 20
Credit: 1,590,002
RAC: 88
Message 5560 - Posted: 4 Oct 2018, 2:51:58 UTC

I don't/won't run LINUX so this great idea sounds like a really efficient way to lose volunteers like me. I've already trashed CPDN because it has become an unstable mess and no one over there seems to be able to fix it. Looks to me like LHC is reading their roadmap to oblivion.

Why is it that projects reach a point of reliable stability, then someone gets a wild hare to start dinking around with the architecture?

I am really sorry to read you will start messing with Windows after this LINUX thing is in place.
ID: 5560 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5563 - Posted: 4 Oct 2018, 8:34:31 UTC - in response to Message 5557.  

Hi gyllic,

I still have one question:

    - Would you like to change the way how jobs are distributed as well? So switch from the HTCondor approach to a similar one like ATLAS where the boinc server is responsible for distributing the jobs?


Yes.

I just had an idea (don't know if that would work): Why not place the CVMFS cache directory within one of the directories that is shared with the host system and does not get deleted (maybe the scratch directory within the project's folder within the Boinc Data directory) and configure CVMFS to use this directory as cache. When a new VM is started, include this directory and the already existing cache can be used again. This way the cache won't get deleted once the VM has shutdown.
Great idea! I am not 100% sure that will work but will check with the developers. I considered using a separate image for the cache but multiple VMs can not share the same image file and currently the slot directories are wiped after each job. The problem is therefore the same as trying to reuse the same image file that we have now. There is still the overhead of the time to boot the VM.
I really like the native ATLAS app. I think the direction you proposed is good since everything that makes it easier for volunteers to set everything up is good (if that is the case). Keeping it as simple as possible for the volunteers is the most important thing!

Have you considered a similar approach as boinc2docker, so something like boinc2singularity? Again, no idea if that approach would work: So providing a very small VM image that runs CVMFS (again place the CVMFS cache directory into a shared directory if thats possible) and is able to execute a singularity image. Then download/execute the prebuild singularity image which then do the actual calculations.

Yes, boinc2docker is an interesting idea but our boundary condition is that we are tied to CVMFS. Something like boinc2singularity is what we are after.
ID: 5563 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5564 - Posted: 4 Oct 2018, 8:40:35 UTC - in response to Message 5560.  
Last modified: 4 Oct 2018, 9:56:02 UTC

I don't/won't run LINUX so this great idea sounds like a really efficient way to lose volunteers like me. I've already trashed CPDN because it has become an unstable mess and no one over there seems to be able to fix it. Looks to me like LHC is reading their roadmap to oblivion.

Why is it that projects reach a point of reliable stability, then someone gets a wild hare to start dinking around with the architecture?

I am really sorry to read you will start messing with Windows after this LINUX thing is in place.


Our view is that improvements are needed. The evidence is that Sixtrack gets 100 times more resources than the VM apps, Yeti wrote a check list and Magic watches his machines. Today people's expectations are that it should work like buying an app from an app store. If we lose you we have done something wrong.
ID: 5564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 5565 - Posted: 5 Oct 2018, 9:29:58 UTC - in response to Message 5563.  

Something like boinc2singularity is what we are after.


Actually more like boinc2runc
ID: 5565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyllic

Send message
Joined: 10 Mar 17
Posts: 40
Credit: 108,345
RAC: 0
Message 5567 - Posted: 7 Oct 2018, 8:51:42 UTC - in response to Message 5563.  

There is still the overhead of the time to boot the VM.
Obviously, there will always be the overhead to boot and shutdown the VM as long as you have to use one. Since also e.g boinc2docker uses VMs, they also have overhead there.

Actually more like boinc2runc
Nice!
ID: 5567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 5568 - Posted: 9 Oct 2018, 13:02:59 UTC

It seems that the recent CMS beta app makes use of an additional layer which makes the whole system more complex instead of less complex.
How does this fit together?


Will the new native apps be flexible enough to run on user provided VMs (like maeax's ATLAS setup) or on plain linux installations?
ID: 5568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyllic

Send message
Joined: 10 Mar 17
Posts: 40
Credit: 108,345
RAC: 0
Message 5633 - Posted: 12 Nov 2018, 11:29:33 UTC

any news regarding this topic?
ID: 5633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : An Alternative Approach For VM applications


©2024 CERN