Message boards : Theory Application : Native Theory Application in Production
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 6349 - Posted: 3 May 2019, 21:55:51 UTC - in response to Message 6337.  

delay_bound for vbox tasks seems to be 2592000 (30 days).
If nobody disagrees I would suggest to set it to 432000 (5 days) or 600000 (a bit less than a week).
Are there other limits we are not aware of in this thread (see: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4028&postid=38669)?


May be obsolete due to CP's post.


It was set short due to the looping jobs experience but from the feedback, the Sherpa jobs may just be long.
ID: 6349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 6350 - Posted: 4 May 2019, 7:22:28 UTC - in response to Message 6347.  


Needs more knowledge/testing.


From you or us?

On my side first to ensure there are no typos in my patches and to get enough experience to describe the issue more precise.
It shouldn't be a showstopper for this version.
ID: 6350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6351 - Posted: 4 May 2019, 8:37:06 UTC - in response to Message 6348.  

The longest running successful job I could find was 9.85 days, but out of 1,584.735 jobs only 3 were running longer than a week.
That is surprising. We’re those jobs fine otherwise?
That are all successful jobs.

Correction of the thousands seperator in my post: 1,584.735 of course should be 1,584,735 cause there are no partial jobs.
Meanwhile 1593805 succeeded jobs from batch 2279.

A delay_bound of 864000 (10 days) would catch them all. No guarantee for jobs of future batches.

We still have some sherpa errors (never ending and exceeding disk limit), but Peter Skands could decide to use only the sherpa version with a higher rate of success than zero. Under investigation
ID: 6351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 6354 - Posted: 6 May 2019, 8:38:52 UTC - in response to Message 6319.  

... Is there anything that needs to be addressed before we take it out of beta?

Yes.

1. Task deadlines should be extended to give longrunners a better chance to finish.

The deadline has been increased to 10 day both on dev and prod.

[2. A method to write the job parameters to stderr.txt early should be included.
A possible solution can be found here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467

A new version has been released on dev and will be put on prod once it has been tested on dev.

[3. Credits given for Theory native are still very poor, especially compared to Theory vbox.
This may result in a bad acceptance.


This is still not understood.
ID: 6354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 486
Message 6360 - Posted: 9 May 2019, 22:58:32 UTC - in response to Message 6326.  
Last modified: 9 May 2019, 23:00:29 UTC

This is how the scheduling is working at the moment.

(... from the production server). The server will only send Atlas if it can't send Theory, either because of preference settings or because there are no Theory tasks available (as happened a few weeks ago)
This is a " VBox free" host where there have been >50 Theory tasks in a row without the server sending an Atlas task. I thought that the server may be trying to equalise the credit between the sub-projects but this is greater than the difference in credit; unless it's trying to make up the backlog which, for me, will take a long time unless the credit for Theory increases considerably.


The scheduling is a bit of a black box..

The current availability of SixTrack work has enabled a wider check.

Of the three LHC subprojects available to this host the server will only send work for one. The priority order being, from the highest, Theory Native, Atlas, SixTrack.
i.e if all three are enabled and available, only Theory tasks are sent; if Atlas and SixTrack are enabled and available, only Atlas tasks are sent. SixTrack tasks are only sent if available when only SixTrack is enabled.

Hosts running Theory VBox and SixTrack work as expected (they don't run Atlas).
ID: 6360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 6361 - Posted: 10 May 2019, 12:35:30 UTC - in response to Message 6354.  

[3. Credits given for Theory native are still very poor, especially compared to Theory vbox.
This may result in a bad acceptance.


This is still not understood.


I have switched to runtime credit for this application. The situation seems to have improved.
ID: 6361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 6362 - Posted: 10 May 2019, 12:39:06 UTC - in response to Message 6360.  


The current availability of SixTrack work has enabled a wider check.

Of the three LHC subprojects available to this host the server will only send work for one. The priority order being, from the highest, Theory Native, Atlas, SixTrack.
i.e if all three are enabled and available, only Theory tasks are sent; if Atlas and SixTrack are enabled and available, only Atlas tasks are sent. SixTrack tasks are only sent if available when only SixTrack is enabled.

Hosts running Theory VBox and SixTrack work as expected (they don't run Atlas).


This is a project issue rather than an issue with the Native Theory app. Nothing will change until after the Pentathlon (19 May).
ID: 6362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,073
RAC: 133
Message 6363 - Posted: 10 May 2019, 12:40:04 UTC - in response to Message 6319.  

... Is there anything that needs to be addressed before we take it out of beta?

Yes.

1. Task deadlines should be extended to give longrunners a better chance to finish.

2. A method to write the job parameters to stderr.txt early should be included.
A possible solution can be found here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=467

3. Credits given for Theory native are still very poor, especially compared to Theory vbox.
This may result in a bad acceptance.


I believe that all these issues are now resolved. Unless there is anything else, I will take it out of Beta on Monday.
ID: 6363 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Theory Application : Native Theory Application in Production


©2024 CERN