Message boards : ATLAS Application : ATLAS Monitoring v3
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6805 - Posted: 8 Nov 2019, 21:47:12 UTC

To prepare ATLAS Monitoring v3 please find below a simulation of the modified screenshot.
While terms like "standard deviation" or "arithmetic mean" are well defined in scientific environments, other terms and phrases, e.g. "uncertainty", may not.
Hence comments from native speakers (English) are welcome (Are there terms/phrases that fit better?).
Of course, other's comments are also welcome.

Uncertainty in this context should give an impression how "good" the time left estimation might be and is derived from
- standard deviation of event runtimes
- number of events not yet finished
- number of workers

Event status lists different tasks phases:
worker 3: N/A # directly after a task has started
worker 4: ... # the logfile shows that event 1 has been started but is not yet finished
worker x: ... # values from the last finished event

Time values are rounded to integer (s, m) as it makes no sense to be more precise.


Simulated screenshot:
*********************************************************************
*                  ATLAS Event Progress Monitoring                  *
*                              v3.y.z                               *
*              last display update (VM time): 16:10:23              *
*********************************************************************
Number of events                  :
   to be processed                :                               200
   already finished               :                               119
Event runtimes                    :
   arithmetic mean                :                            1033 s
   standard deviation             :                              16 s
Estimated time left               :
   total                          :      overdue      0 d   5 h  44 m
   uncertainty                    :                   0 d   0 h   5 m
---------------------------------------------------------------------
Event status per worker thread:
worker 1: Event nr. 30 took 1385 s
worker 2: Event nr. 29 took 978 s
worker 3: N/A
worker 4: Event nr. 1 currently processing
worker 5: Event nr. 30 took 807 s
worker 6: Event nr. 30 took 1131 s
---------------------------------------------------------------------
Calculation completed. Preparing HITS file ...
ID: 6805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,312
RAC: 1,368
Message 6807 - Posted: 9 Nov 2019, 6:38:03 UTC - in response to Message 6805.  

Hence comments from native speakers (English) are welcome (Are there terms/phrases that fit better?).
Of course, other's comments are also welcome.

Less is more v2.2.0 is simple readable and more than we need.
ID: 6807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6808 - Posted: 9 Nov 2019, 10:55:57 UTC - in response to Message 6805.  

Time values are rounded to integer (s, m) as it makes no sense to be more precise.
That's an improvement.

Estimated time left               :
   total                          :      overdue      0 d   5 h  44 m
   uncertainty                    :                   0 d   0 h   5 m
overdue is just for the examble, I suppose.

Make it simple is my comment.

Just call 'arithmetic mean' average and replace 'standard deviation' by 2 lines: minimum and maximum event run time of all done tasks.

The runtime during a task may and will vary a lot by itself and by the load of the host too, so it makes no sense to be very precise.
Is the host always under 100% load or are there periods that only the ATLAS-task is running.
ID: 6808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6810 - Posted: 10 Nov 2019, 20:23:46 UTC

Thanks for the comments made so far.
Over the weekend I modified the monitoring script so it will produce the output below.

Explanation

arithmetic mean
I'd prefer this term instead of average as it explains exactly which average is used here.
But since it is only a string it can easily be replaced.

standard deviation
I replaced it with "min / max" following CP's request.
I agree that the latter may be more meaningful.
In addition sd is used to calculate "uncertainty", hence it would be redundant.

overdue
This term appears only when time left becomes <0, typically when the last event is a longrunner.

Simulated screenshot:
***************************************************
*         ATLAS Event Progress Monitoring         *
*                     v3.0.0                      *
*     last display update (VM time): 20:59:12     *
***************************************************
Number of events
   to be processed  :                           200
   already finished :                           125
Event runtimes
   arithmetic mean  :                        1035 s
   min / max        :                  678 / 1482 s
Estimated time left
   total            :                   10 h 46 min
   uncertainty      :                        10 min
---------------------------------------------------
Status of last event per worker thread:
worker 1: Event nr. 62 took 1073 s
worker 2: Event nr. 63 took 1207 s
ID: 6810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6811 - Posted: 10 Nov 2019, 21:16:58 UTC - in response to Message 6810.  

Over the weekend I modified the monitoring script so it will produce the output below.
Can't wait to see this version going live ;)
ID: 6811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 6818 - Posted: 14 Nov 2019, 8:31:36 UTC - in response to Message 6811.  
Last modified: 14 Nov 2019, 8:33:02 UTC

This version is now live!

Some minor comments on the English:

"to be processed: 200" sounds like this is the number of remaining events, it should be made clearer that this is the total so I would just say "total" here

"overdue" might make people worry that there is something wrong with the task, I'm not sure it's a good idea to have this, or maybe there can be a clearer message what it means

"Event status per worker thread" -> "Processing status per worker thread"

EDIT: Ignore the last comment since this phrase was already fixed on the newest version.
ID: 6818 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6819 - Posted: 14 Nov 2019, 8:59:15 UTC - in response to Message 6818.  

"to be processed: 200" sounds like this is the number of remaining events, it should be made clearer that this is the total so I would just say "total" here

OK
Will postpone the change until its decided what to do with the next one.


"overdue" might make people worry that there is something wrong with the task, I'm not sure it's a good idea to have this, or maybe there can be a clearer message what it means

What about "overtime" or "extra time"?
Most people may know that from sports.

Just to mention it, nothing to worry about:
This text field has a fixed width.
If we extend it the whole layout will be a bit wider.
ID: 6819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6820 - Posted: 14 Nov 2019, 9:30:21 UTC - in response to Message 6819.  

What about "overtime" or "extra time"?
Most people may know that from sports.
"overtime" sounds good to me. It's also known as 'overwork', what's the case when the job needs more time.
ID: 6820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6822 - Posted: 14 Nov 2019, 10:05:23 UTC

Overtime Message

Forgot to mention:
The same term is shown while the HITS file is in preparation.
Would you prefer to
- leave it blank and just count the remaining time upwards (confusing?)
- display the same term
- display a different term
ID: 6822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6823 - Posted: 14 Nov 2019, 10:59:25 UTC - in response to Message 6822.  

Overtime Message

Forgot to mention:
The same term is shown while the HITS file is in preparation.
Would you prefer to
- leave it blank and just count the remaining time upwards (confusing?)
- display the same term
- display a different term
When the HITS-file is in preparation, just leave it (overdue) blank and display 0 (zero) for time left and uncertainty. My opinion ;)
ID: 6823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 6824 - Posted: 14 Nov 2019, 13:19:53 UTC - in response to Message 6823.  

Overtime Message

Forgot to mention:
The same term is shown while the HITS file is in preparation.
Would you prefer to
- leave it blank and just count the remaining time upwards (confusing?)
- display the same term
- display a different term
When the HITS-file is in preparation, just leave it (overdue) blank and display 0 (zero) for time left and uncertainty. My opinion ;)


That sounds good. I would like to avoid if possible alarming people with scary-looking messages, when the task is proceeding normally.
ID: 6824 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6825 - Posted: 14 Nov 2019, 14:59:53 UTC

Just created a pull request at github to update to v3.1.0

changes

Output string
- "Number of events to be processed" to "Number of events total"
- "overdue" to "overtime"

While HITS file generation is in progress
- "overtime" will not appear any more
- "Time left total" will remain 0 (zero)
ID: 6825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 6828 - Posted: 15 Nov 2019, 8:25:33 UTC - in response to Message 6825.  

3.1.0 is now live.
ID: 6828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6829 - Posted: 15 Nov 2019, 9:45:08 UTC - in response to Message 6828.  

3.1.0 is now live.
That's looking good:

ID: 6829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6834 - Posted: 16 Nov 2019, 19:41:08 UTC

Last minute change request.


v3.1.0 uses standard deviation to calculate uncertainty.
This covers 68.27% (2 sigma) of the runtime values' standard distribution.

To cover 99.73% (6 sigma) standard deviation must be multiplied with 3.

Did a few local simulations with 6 sigma and it looks like the resulting values make more "sense" when compared to other values on the display, e.g. arithmetic mean or min/max.

Sent a github pull request to David that includes the change to v3.2.0.



Beside that the recent version runs fine and I suggest to deploy it on the production server.
ID: 6834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 6835 - Posted: 18 Nov 2019, 9:32:42 UTC - in response to Message 6834.  

3.2.0 is now live here. Let's wait for a few test WU here (I just submitted a few more) then I will deploy in production.
ID: 6835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6836 - Posted: 18 Nov 2019, 9:56:16 UTC

v3.2.0 running here with an older task, that was 'In progress' for the server and on my client 'Ready to start' waiting for the new Monitoring version.
I'll make screen captures every minute to hopefully find nothing weird.
ID: 6836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6837 - Posted: 18 Nov 2019, 12:38:12 UTC

Finished a few 3-core and a few 1-core tasks with monitoring v3.2.0.
All of them ran like a Swiss clockwork. ;-)
ID: 6837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 6838 - Posted: 18 Nov 2019, 13:01:03 UTC - in response to Message 6837.  
Last modified: 18 Nov 2019, 13:01:46 UTC

Finished a few 3-core and a few 1-core tasks with monitoring v3.2.0.
All of them ran like a Swiss clockwork. ;-)

You have to look quick, but when the HITS-line is deleted (almost) at the end of the task, just before the VM is stopped and RDP closed
'uncertainty' shows -1; must be a leap second and not a real issue ;)
ID: 6838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 6839 - Posted: 18 Nov 2019, 13:22:40 UTC - in response to Message 6838.  

Finished a few 3-core and a few 1-core tasks with monitoring v3.2.0.
All of them ran like a Swiss clockwork. ;-)

You have to look quick, but when the HITS-line is deleted (almost) at the end of the task, just before the VM is stopped and RDP closed
'uncertainty' shows -1; must be a leap second and not a real issue ;)

Saw this only once during all the tests.
It happens while the VM is about to shut down all processes and the monitoring can't get valid data any more.
I suggest to ignore it since a few seconds later the VM shuts down anyway.
ID: 6839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS Application : ATLAS Monitoring v3


©2023 CERN