Message boards :
CMS Application :
Ready For Production
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
Over the past few weeks there have been a number of improvements made to the CMS application. With the exception of the suspend/resume issue, what needs to be addressed before this is ready for production? Ready for production means that the quality/experience of the CMS application is comparable to the Theory application that is currently running in vLHC@home. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The most important thing, i believe, is the host backoff, that has not been addressed. There were also a number of "stage-out errors", where cms-runs and condor exits were status 0, but the server listed stage-out errors. We would need a "clean-run", where everybody is running the same version and no "fiddling" with the server, until a whole batch completes to get some realistic error figures. Also important is to error-out a boinc-tasks, if too many job fails occur. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,946,836 RAC: 2,933 |
Also important is to error-out a boinc-tasks, if too many job fails occur I agree, and I can see a few ways of doing it, depending on what facilities are available, client-side. One would be, at the end of a glide-in "run" to count the number of non-zero exit statuses and terminate the task if a threshold is exceeded (I'd argue for threshold=1). You'd need to make sure that credit was still issued for CPU time spent (I've noticed that if I abort a task, I get no credit for it even if I'd run several jobs successfully before then). |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
Hopefully the host backoff and too many jobs failures protections have now been implemented. Whether they work as expected in the wild with all the random situations is still to be seen. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
It is Friday afternoon so I will keep my hands off for a while. Let's see how stable it is over the weekend. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
The efficiency over the weekend was 92% both in terms of jobs and wall-clock time. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
The efficiency over the weekend was 92.7% in terms of jobs and 94.34% in terms of wall-clock time with stage-out errors accounting for 75% of the inefficiency. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
The efficiency over the past 24 hours was 96.67% in terms of jobs and 99.35% in terms of wall-clock time with no stage-out errors reported. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
The efficiency over the past 24 hours was 96.58% in terms of jobs and 99.4% in terms of wall-clock time with no stage-out errors reported. Will re-enable the beta tasks. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,116 |
Will re-enable the beta tasks. The number of running jobs is rising rapidly. Ivan has to feed the monster more often or bigger portions ;) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The beta tasks run per default 2 at a time. I find that questionable, as they consume a lot of memory. Why is that done, if it was not even possible here? The issues here have not been resolved, yet they are put on production again. De ja vue? I think, we had this before. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,946,836 RAC: 2,933 |
Will re-enable the beta tasks. I'll do my best... :-0! |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
The number of tasks is part of the project configuration and I am not sure if you can set this per app. Please post this request in respective project forum. http://lhcathome2.cern.ch/vLHCathome/forum_forum.php?id=28 As far as I am aware the only major issue outstanding is the suspend/resume issue. Looking at how well it was running over the weekend, it seems to be in good shape now. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,876,910 RAC: 16,233 |
Laurence, has the problem of lack of bandwidth for uploading all been sorted now ? If the upload fails once, the run ends (as well as Boinc task) and PC has to start a new Boinc task ? Does work already completed before that upload fail get credited ? Sorry, not been following as closely, just wanted to know if this was now the case. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
I believe that Ivan is sending jobs with smaller output files. Since this change we have not seen many stage-out errors. Credit is given for failed jobs. Note that over the weekend 99.4% of walltime used was successful. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,876,910 RAC: 16,233 |
I believe that Ivan is sending jobs with smaller output files. Since this change we have not seen many stage-out errors. Credit is given for failed jobs. Note that over the weekend 99.4% of walltime used was successful. But no credit if no work (nothing uploaded) is actually done ? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,116 |
I believe that Ivan is sending jobs with smaller output files. As far as I have noticed, it makes no difference measured over a whole day. Current jobs have about 67MB to upload as major file, but when a job is twice as long the upload file is also twice as big. I prefer the shorter ones because of less waste when a job doesn't finish for what ever reason. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 87 |
Credit will still be given if the upload fails. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,876,910 RAC: 16,233 |
Credit will still be given if the upload fails. To be pedantic (for a change), if no work is done and nothing is uploaded, credit will still be given when a Boinc task completes ? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Credits are so minimal, it is not worth talking about. 3-5 credits per 6 hour boinc task? |
©2024 CERN