Message boards : News : Migrating to vLHC@home
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1653 - Posted: 27 Jan 2016, 14:59:44 UTC

As most of you already know, the aim of this project was to get the CMS application to a point where it was mature enough to be added to the vLHC@home project as a beta app. We believe that we have now reached that point and would like migrate our activity to that project.

The CMS beta application which should be identical to this one is now available in vLHC@home. Out of the 190 volunteers that have credit, 89% already have a vLHC@home account. Please could everyone who is running here try out the CMS beta application from vLHC@home. To do this you will need to go to the vLHCathome preferences and enable CMS Simulation.

http://lhcathome2.cern.ch/vLHCathome/prefs.php?subset=project

If the beta app is working for you, please stop running here. Once most have migrated, no new tasks will be created and the accumulated credit can be migrated from here to vLHC@home.

Please post any comments or issues relating to the migration in this thread.

Thanks to everyone who has supported this project and enabled us to get to where we are today.

Laurence
ID: 1653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 1654 - Posted: 27 Jan 2016, 15:16:07 UTC - in response to Message 1653.  
Last modified: 27 Jan 2016, 15:20:41 UTC

The major issue with this project is, that hosts can produce fails after fails,without any notification.
How is this addressed in vLHC?
ID: 1654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1655 - Posted: 27 Jan 2016, 15:21:54 UTC - in response to Message 1654.  

We are working on improving the shutdown mechanism and communication with the BOINC client. A new version of the app will be created that uses the completion_trigger_file. It will hopefully:

    * Gracefully shutdown the VM (no more killing of the last job)
    * Shutdown on error and report an error message
    * Shutdown when there are no more jobs

ID: 1655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 1656 - Posted: 27 Jan 2016, 15:29:33 UTC - in response to Message 1655.  

Thanks, Laurence.
Would it not be wiser to test it here first?
Maybe with a smaller batch?

Here you have people, that pay a bit more attention, if something goes wrong.
ID: 1656 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 1657 - Posted: 27 Jan 2016, 15:30:09 UTC

A problem I had was that vLHC@Home downloaded and started two CMS tasks while my CMS-dev task was running. I didn't notice any problem at the time. The CMS-dev job finished shortly after and BOINC fetched a new one, which waited until a SETI@Home job finished before it started. Then everything locked up -- the machine only has 6 GB of RAM. I spent some painful time trying to get a mouse-click in edgeways while the machine was spending 100% CPU in kswapd.
Eventually one of the VMs died, and things became almost free. Luckily I'd set no-new-tasks so it didn't download and start another task! I'm now running with two VMs and two S@H CPU jobs, on the very limit of free RAM.
Subsequent reading of the log files showed that the second VM had a CMS job die in the middle of processing, but it started -- and subsequently finished -- another one. I got no Condor log from the failed job, but it later ran successfully on another host.
So I guess I'd recommend setting NNT in CMS-dev and letting any task time out before activating vLHC jobs. And make sure you have lots of RAM (there is a way of limiting yourself to one task if you do have RAM problems).
ID: 1657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1658 - Posted: 27 Jan 2016, 15:49:17 UTC - in response to Message 1656.  

Yes, we can continue development here with a few testers but the real jobs and operational focus should shift to vLHC@home.
ID: 1658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 1659 - Posted: 27 Jan 2016, 16:01:38 UTC - in response to Message 1658.  

Yes, we can continue development here with a few testers but the real jobs and operational focus should shift to vLHC@home.


I understand, but in my opinion, it is too early for that.
Still too many bugs.

But,of course, it is your decision.
ID: 1659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bill Michael

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 339,021
RAC: 0
Message 1660 - Posted: 27 Jan 2016, 16:41:10 UTC - in response to Message 1653.  

If the beta app is working for you, please stop running here. Once most have migrated, no new tasks will be created and the accumulated credit can be migrated from here to vLHC@home.


A) I have no interest in running vLHC, as they have the same (self-created) problems as CMS-Dev, but are "production", and don't appear to even be trying to solve the issues. There are a LOT of issues still here that haven't been solved, so I guess you're giving up on them. Sorry, I'll put up with problems for a Beta project, hoping to help solve them, but I'm not going to spend all the time it takes to deal with VBox, poor task management, network and memory and disk overload, etc., there.

B) "Credit migrated" bothers me a LOT. I hope you rethink this, and quickly. For one thing, that is pretty much totally against the "rules" of BOINC as I understand them (and I was there when they were written in 2003-5, in fact I was tasked by Paul with documenting them for the Wiki). Because 1) if someone (like me) does NOT go over to vLHC, what happens to our credit? We earned it here, so I would hope that it would stay in the statistics sites and not just vanish... there is, after all, a "retired project" category for just this situation, and that is exactly what I naturally assumed would happen when the beta was completed, as it has for many other beta projects - but 2) if someone DOES move to vLHC, does that mean they get credit in BOTH places? That is SURE to attract some attention when it becomes common knowledge, especially from Dr. A... Give a CMS-Dev badge at vLHC if you wish, sure. But credit? No way. Different project. Credit is NOT transferrable! SETI didn't even do it. If the credit was to eventually wind up at vLHC, then CMS should have been a beta application within vLHC, NOT set up as a separate project.

I have absolutely no desire to have a single credit at vLHC, as this would totally screw up my stats, which I have carefully nurtured for over 10 years. Nor would I be happy to see the effort I've given here be lost.
ID: 1660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 538
Credit: 7,547,069
RAC: 1,614
Message 1663 - Posted: 27 Jan 2016, 19:38:33 UTC

All of mine were already set to allow CMS over at vLHC not to mention I have been running those 24/7 for almost 5 years and the last several years have had no problems there.

http://lhcathome2.cern.ch/vLHCathome/top_hosts.php?sort_by=total_credit

I am in the process right now of aborting all the CMS-dev I have left ( just for you)
ID: 1663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1666 - Posted: 27 Jan 2016, 22:05:04 UTC - in response to Message 1659.  

Hi Rasputin,

Please can you let me know which bugs you think need fixing.

Thanks,

Laurence
ID: 1666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Belgique] MK

Send message
Joined: 5 Jan 16
Posts: 3
Credit: 1,385
RAC: 0
Message 1667 - Posted: 27 Jan 2016, 22:14:22 UTC

CMS uses a lot of bandwidth, much more that T4T. It also requires a lot more RAM which I don't have (or that I need for other programs).

So, is it possible to ask to have only one CMS task at most?
ID: 1667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1668 - Posted: 27 Jan 2016, 22:17:47 UTC - in response to Message 1660.  

Hi Bill,

A) vLHC is a project which we hope will run multiple LHC related applications. If you have any issues with it today, they will be related to the Test4Theory application. Let us know what your main issues are and we can see if we can address them.

B) On this we are happy to be guided by the community. The badge idea is a good one.
ID: 1668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1669 - Posted: 27 Jan 2016, 22:21:49 UTC - in response to Message 1667.  

Hi MK,

I think that this is already the case. For those who have more powerful and dedicated setups we can create a multicore versions of the app. e.g. 2 cores, 4 cores etc.
ID: 1669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 1670 - Posted: 27 Jan 2016, 22:42:46 UTC
Last modified: 27 Jan 2016, 22:44:37 UTC

We are working on improving the shutdown mechanism and communication with the BOINC client. A new version of the app will be created that uses the completion_trigger_file. It will hopefully:


Gracefully shutdown the VM (no more killing of the last job)
Shutdown on error and report an error message
Shutdown when there are no more jobs


These points need to be implemented, agreed.
It is also important, that suspending and resuming a task for extended periods of time, maybe up to 1 day (including proper resume after internet interruption) will work.

(Credits should be somewhat reflecting the work done, not just the time connected to the server) -not essential, but nice.

The shutdown on error has to be address carefully to not shut down the task prematurely, as restarting a new boinc-tasks involves a substantial amount of downloading bandwidth and time.
Individual job failures can and will occur, but if happening repeatedly should cause the boinc task to fail.

There should be sufficient support for people that have issues, on the message boards, which has been poor on most cern-related forums(from my observation)

That is all, i can think of, for now.
Thanks.
ID: 1670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Bill Michael

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 339,021
RAC: 0
Message 1671 - Posted: 28 Jan 2016, 2:23:53 UTC - in response to Message 1670.  

The main problem is the whole task-versus-job design. Constant (as in once per second, in my experience) polling of a server looking for work, then processing it and returning it, outside of the BOINC framework which already handles downloads, trickles, and uploads (including following the users network preferences, automatic retries, etc) while appearing to BOINC to be one "task", creates innumerable problems. This is in addition to the failure handling (nonexistent AFAIK, since you don't use BOINC's), the credit-for-slot-rental-not-work-done issue, the SSD thrashing which appears to be endemic to VBox, VBox itself (which seems to get 'corrupted' and require project resets way too often), and others. You really need to ASSUME that a user is on dial-up, and only has (slow) internet access for a few minutes a day (or week)! If you can't work with that, then you need to be very explicit in the requirements of the project. On the home page, NOT buried somewhere in the message boards. CPDN manages to move huge amounts of data (including trickles), all within the BOINC framework...

First, REQUIRED fix, before I would even consider running it elsewhere, is to decide how much actual work (number of jobs) a task is going to have, and send it with that many already embedded in it. Forget the 24-hour thing. If you want to reduce load on the server, fine, include "x" jobs in the task, such that the average host would take 24 hours to complete them. If you can handle more load and want shorter tasks, then include "x/4", or whatever. When an individual job is finished, if you wish, use the BOINC trickle mechanism to send back the results. Otherwise, upload them all at task completion. This one change would repair at least 75% of the networking, swap-out, error handling, time estimation, benchmarking, etc. concerns, AND solve any credit calculation issues. (Which apparently never did get dealt with here, in spite of repeated "we'll look at that later" - um, it's well past "later"...)

The issues that remain are probably VBox related, and I have no idea how to deal with those. Those are, however, enough to ensure that I won't be signing up for any future projects that require VBox, unless I see on the boards that they've all been worked out. When my Linux box told me I had to recompile something in order to get VBox working again after a simple OS update, and CMS was saying VBox wasn't there, well, when I finished laughing hysterically, I did it BECAUSE it was for CMS-Dev (a beta project). Same when I first set up my MacBook and CMS couldn't "see" VBox until I'd updated it like three times to new versions. If any of this had been for a "production" project, I would have detached and never come back. Or bitched on the message boards until the PROJECT solved the problem (probably by providing a link to the needed, precompiled, software, or at least an installer script and package - NOT by providing instructions on how to find some source code on the web somewhere and then compile something).

I haven't hit any memory problems with CMS, but most of my hosts have plenty - I have seen tasks from other projects on rare occasion in a "waiting for memory" state instead of "waiting to run", so BOINC obviously has some method of dealing with this. (New since I last looked at the code, which was 10 years ago, or I just didn't see it.) I'm sure it's related to the "use x% memory when computer is in use/not in use" setting, as the times I've seen the message are when I just walked up and moved the mouse. I've never seen any project lock up a machine as Ivan described vLHC-CMS doing to him. It _had_ to ignore the user preferences to be able to do that. (Which VBox does anyway on thread priorities, but that's another subject.)

I had the discussion with Ivan on the host failures issue, and the fact that he had to ask the volunteer (which was me several times, I freely admit) to reset, disconnect, desist, bypass security software, run fewer hosts per bandwidth, etc. - these are all things that in a production environment must be fixed on YOUR end, NOT ours! It's fine in a beta project to run into problems that require us to do something - it isn't tolerable in a production project. That's where vLHC has fallen down, from what I can tell; the volunteers are responsible for "doing something" way too often, other than just signing up for the project and watching credits roll in. Some, it is true, have no problems. Others aren't so lucky, and getting help on the message boards is iffy. Yes, I am a programmer. Have been since FORTRAN was state-of-the-art. But on BOINC, at least for the non-beta projects, I try NOT to be that, I try to be a "typical user". That means complaining a lot. :-) And THAT is something I know I'm good at! (Looked around the various boards the other day and realized that combined, I have several THOUSAND postings... and that's with an 8-year hiatus from BOINC!)

I really do wish you luck. And applaud Ivan for his patience, diligence, and support on the boards. With two or three more of him, you might be okay.
ID: 1671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Belgique] MK

Send message
Joined: 5 Jan 16
Posts: 3
Credit: 1,385
RAC: 0
Message 1672 - Posted: 28 Jan 2016, 2:35:48 UTC - in response to Message 1669.  

great, thx. Will add CMS in Vlhc then.
ID: 1672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 0
Message 1673 - Posted: 28 Jan 2016, 6:38:33 UTC - in response to Message 1671.  

... the SSD thrashing which appears to be endemic to VBox, VBox itself (which seems to get 'corrupted' and require project resets way too often) ...

Using the recommended BOINC version 7.6.22 will substantial decrease the amount of problems here.
For your Win10 machine: it's known there is an issue in the combination Win10 versus VBox 5.0.10. Upgrade your Win10 to VBox 5.0.12.
ID: 1673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1675 - Posted: 28 Jan 2016, 8:16:14 UTC - in response to Message 1673.  

I am using VBox 5.0.14 on a Windows 10 PC. I have 6 Virtual Machines, plus the Challenge. On one of the six, running SuSE Linux Leap 42.1 I have installed a second tier VirtualBox to run vLHC@home.
Tullio
ID: 1675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 28
Credit: 278,244
RAC: 0
Message 1681 - Posted: 28 Jan 2016, 11:54:04 UTC
Last modified: 28 Jan 2016, 11:54:26 UTC

Comments regarding virulent comments above :

- other projects have dedicated and separated beta projects like Seti or Einstein, why not vLHC ?

- other projects have "migrated credits" : CSG from 3 different boinc projects. Nobody died.
ID: 1681 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 1682 - Posted: 28 Jan 2016, 12:27:30 UTC - in response to Message 1681.  

Jerome,

Yes we should have a dev project along side the production project but would like to have no more than two projects. As highlighted, communication and support is very important and by focusing on one forum/project we can hopefully do a better job than having to monitor many forums/projects.
ID: 1682 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : News : Migrating to vLHC@home


©2020 CERN