Message boards : CMS Application : Ready For Production
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2193 - Posted: 3 Mar 2016, 20:18:27 UTC

Over the past few weeks there have been a number of improvements made to the CMS application. With the exception of the suspend/resume issue, what needs to be addressed before this is ready for production? Ready for production means that the quality/experience of the CMS application is comparable to the Theory application that is currently running in vLHC@home.
ID: 2193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,500
RAC: 5
Message 2195 - Posted: 3 Mar 2016, 20:47:47 UTC
Last modified: 3 Mar 2016, 20:50:37 UTC

The most important thing, i believe, is the host backoff, that has not been addressed.

There were also a number of "stage-out errors", where cms-runs and condor exits were status 0, but the server listed stage-out errors.

We would need a "clean-run", where everybody is running the same version and no "fiddling" with the server, until a whole batch completes to get some realistic error figures.

Also important is to error-out a boinc-tasks, if too many job fails occur.
ID: 2195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2210 - Posted: 4 Mar 2016, 8:47:55 UTC - in response to Message 2195.  

Also important is to error-out a boinc-tasks, if too many job fails occur

I agree, and I can see a few ways of doing it, depending on what facilities are available, client-side. One would be, at the end of a glide-in "run" to count the number of non-zero exit statuses and terminate the task if a threshold is exceeded (I'd argue for threshold=1). You'd need to make sure that credit was still issued for CPU time spent (I've noticed that if I abort a task, I get no credit for it even if I'd run several jobs successfully before then).
ID: 2210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2220 - Posted: 4 Mar 2016, 12:36:14 UTC - in response to Message 2195.  

Hopefully the host backoff and too many jobs failures protections have now been implemented. Whether they work as expected in the wild with all the random situations is still to be seen.
ID: 2220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2227 - Posted: 4 Mar 2016, 15:14:40 UTC - in response to Message 2220.  

It is Friday afternoon so I will keep my hands off for a while. Let's see how stable it is over the weekend.
ID: 2227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2259 - Posted: 7 Mar 2016, 15:09:48 UTC - in response to Message 2227.  

The efficiency over the weekend was 92% both in terms of jobs and wall-clock time.
ID: 2259 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2383 - Posted: 14 Mar 2016, 8:45:12 UTC - in response to Message 2259.  

The efficiency over the weekend was 92.7% in terms of jobs and 94.34% in terms of wall-clock time with stage-out errors accounting for 75% of the inefficiency.
ID: 2383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2399 - Posted: 15 Mar 2016, 23:36:53 UTC - in response to Message 2383.  

The efficiency over the past 24 hours was 96.67% in terms of jobs and 99.35% in terms of wall-clock time with no stage-out errors reported.
ID: 2399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2458 - Posted: 20 Mar 2016, 22:48:53 UTC - in response to Message 2399.  

The efficiency over the past 24 hours was 96.58% in terms of jobs and 99.4% in terms of wall-clock time with no stage-out errors reported. Will re-enable the beta tasks.
ID: 2458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,653
RAC: 2
Message 2460 - Posted: 21 Mar 2016, 8:34:59 UTC - in response to Message 2458.  

Will re-enable the beta tasks.

The number of running jobs is rising rapidly. Ivan has to feed the monster more often or bigger portions ;)
ID: 2460 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,500
RAC: 5
Message 2461 - Posted: 21 Mar 2016, 8:55:13 UTC - in response to Message 2458.  

The beta tasks run per default 2 at a time.
I find that questionable, as they consume a lot of memory.
Why is that done, if it was not even possible here?

The issues here have not been resolved, yet they are put on production again.

De ja vue? I think, we had this before.
ID: 2461 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 2462 - Posted: 21 Mar 2016, 9:38:33 UTC - in response to Message 2460.  

Will re-enable the beta tasks.

The number of running jobs is rising rapidly. Ivan has to feed the monster more often or bigger portions ;)

I'll do my best... :-0!
ID: 2462 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2463 - Posted: 21 Mar 2016, 10:10:40 UTC - in response to Message 2461.  

The number of tasks is part of the project configuration and I am not sure if you can set this per app. Please post this request in respective project forum.

http://lhcathome2.cern.ch/vLHCathome/forum_forum.php?id=28

As far as I am aware the only major issue outstanding is the suspend/resume issue. Looking at how well it was running over the weekend, it seems to be in good shape now.
ID: 2463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 215
Credit: 2,113,209
RAC: 1,048
Message 2465 - Posted: 21 Mar 2016, 10:21:26 UTC - in response to Message 2463.  

Laurence, has the problem of lack of bandwidth for uploading all been sorted now ?

If the upload fails once, the run ends (as well as Boinc task) and PC has to start a new Boinc task ?

Does work already completed before that upload fail get credited ?

Sorry, not been following as closely, just wanted to know if this was now the case.
ID: 2465 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2466 - Posted: 21 Mar 2016, 10:37:27 UTC - in response to Message 2465.  

I believe that Ivan is sending jobs with smaller output files. Since this change we have not seen many stage-out errors. Credit is given for failed jobs. Note that over the weekend 99.4% of walltime used was successful.
ID: 2466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 215
Credit: 2,113,209
RAC: 1,048
Message 2467 - Posted: 21 Mar 2016, 10:46:08 UTC - in response to Message 2466.  

I believe that Ivan is sending jobs with smaller output files. Since this change we have not seen many stage-out errors. Credit is given for failed jobs. Note that over the weekend 99.4% of walltime used was successful.


But no credit if no work (nothing uploaded) is actually done ?
ID: 2467 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,653
RAC: 2
Message 2468 - Posted: 21 Mar 2016, 10:50:45 UTC - in response to Message 2466.  

I believe that Ivan is sending jobs with smaller output files.

As far as I have noticed, it makes no difference measured over a whole day.

Current jobs have about 67MB to upload as major file, but when a job is twice as long the upload file is also twice as big.
I prefer the shorter ones because of less waste when a job doesn't finish for what ever reason.
ID: 2468 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 2474 - Posted: 21 Mar 2016, 12:44:39 UTC - in response to Message 2467.  

Credit will still be given if the upload fails.
ID: 2474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 215
Credit: 2,113,209
RAC: 1,048
Message 2475 - Posted: 21 Mar 2016, 12:57:43 UTC - in response to Message 2474.  

Credit will still be given if the upload fails.

To be pedantic (for a change), if no work is done and nothing is uploaded, credit will still be given when a Boinc task completes ?
ID: 2475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,500
RAC: 5
Message 2477 - Posted: 21 Mar 2016, 14:28:26 UTC - in response to Message 2475.  

Credits are so minimal, it is not worth talking about.
3-5 credits per 6 hour boinc task?
ID: 2477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : CMS Application : Ready For Production


©2020 CERN