Message boards : ATLAS Application : ATLAS Monitoring v2.0.0
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6793 - Posted: 31 Oct 2019, 8:15:30 UTC

An improved version of the ALT-F2 and ALT-F3 monitoring is now online for testing.
David provided some tasks with 10 events each.

Comments are welcome.


Changes

ALT-F3
Minor changes to ensure the top console doesn't activate power saving mode or screensaver.



ALT-F2
- modified layout
- avoids flicker while display updates
- modified (smoother) "time left" algorithm
- a bit more verbose when calculation has finished
- background part (logfile dumping) is now a systemd service that is tied to the foreground service.
Hitting CTRL-c at tty2 will now cleanup the monitoring directories and restart both services from the scratch.
ID: 6793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6794 - Posted: 31 Oct 2019, 9:32:41 UTC - in response to Message 6793.  

My first test failed -> https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833924

Although task manager said 9GB available, I'll stop my Linux VM to free up memory.
ID: 6794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6795 - Posted: 31 Oct 2019, 11:55:32 UTC - in response to Message 6793.  

An improved version of the ALT-F2 and ALT-F3 monitoring is now online for testing.
David provided some tasks with 10 events each.

Comments are welcome.
Comments to ALT-F2:

- I saw once a difference between 'Number of events already finished' and total of the worker events. In my case 2 already finished and 3 in the workers.
... Probably worker finished just after the sum was counted. Not a real issue. (have an image of it)
- Better reading when (vbox) in first line is placed towards end: ATLAS Event Progress Monitoring (vbox). Not a real issue too or skip (vbox) at all.
- When calculating the new averages after 2 or more events in one worker have finished the time of the first event is not taken into the sum.
... So after the 2nd event has finished the average is of the 2nd event only (the same value) and after 3 events the average shown is the average of event 2 and 3.
- When the last event is processed the display is cleared and the last events and their averages are not longer visible. Would it be possible to keep the last worker events in display and
... show 'Calculation completed. Preparing HITS file ...' below it?
ID: 6795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6796 - Posted: 31 Oct 2019, 15:29:29 UTC - in response to Message 6795.  

Thanks for testing.
Great to have a discussion here!

- I saw once a difference between 'Number of events already finished' and total of the worker events. In my case 2 already finished and 3 in the workers.
... Probably worker finished just after the sum was counted. Not a real issue. (have an image of it)

Right.
That's due to "lazy" programming @^-^@ .
I expected that to happen, but very very rarely.
Should be fixed to avoid confusion.




- Better reading when (vbox) in first line is placed towards end: ATLAS Event Progress Monitoring (vbox). Not a real issue too or skip (vbox) at all.

Will remove (vbox) if nobody disagrees.




- When calculating the new averages after 2 or more events in one worker have finished the time of the first event is not taken into the sum.
... So after the 2nd event has finished the average is of the 2nd event only (the same value) and after 3 events the average shown is the average of event 2 and 3.

Aah! Right!
Didn't notice that before but just checked an ATLAS native logfile - it's the very same there:
2019-10-31 07:27:11,349 ISFG4SimSvc          INFO 	 Event nr. 1 took 588.7 s. New average 588.7 +- 0
2019-10-31 07:41:34,960 ISFG4SimSvc          INFO 	 Event nr. 2 took 744.5 s. New average 744.5 +- 0
2019-10-31 07:55:30,289 ISFG4SimSvc          INFO 	 Event nr. 3 took 770.4 s. New average 757.4 +- 12.94
2019-10-31 08:05:18,345 ISFG4SimSvc          INFO 	 Event nr. 4 took 546.1 s. New average 687 +- 70.82
2019-10-31 08:17:09,359 ISFG4SimSvc          INFO 	 Event nr. 5 took 612.2 s. New average 668.3 +- 53.45

So it looks like you stumbled over an old (?) error inside the scientific app.

@David Cameron
Be so kind as to forward that issue to the ATLAS developers.

Regarding the monitoring there are 2 options
- not to change anything and to live with the wrong average until the app is corrected
- calculate the averages from the runtimes in the worker logs
Will most likely implement the latter option.



- When the last event is processed the display is cleared and the last events and their averages are not longer visible. Would it be possible to keep the last worker events in display and
... show 'Calculation completed. Preparing HITS file ...' below it?

Not much effort.
Can do that if nobody disagrees.
ID: 6796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6798 - Posted: 2 Nov 2019, 9:03:15 UTC - in response to Message 6796.  

Regarding the monitoring there are 2 options
- not to change anything and to live with the wrong average until the app is corrected
- calculate the averages from the runtimes in the worker logs
Will most likely implement the latter option.
I'm not sure if that option is the best solution.
What if a task is suspended (several times). I think in your option, it will influence the event times and averages.
ID: 6798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6799 - Posted: 2 Nov 2019, 11:37:26 UTC - in response to Message 6798.  

Regarding the monitoring there are 2 options
- not to change anything and to live with the wrong average until the app is corrected
- calculate the averages from the runtimes in the worker logs
Will most likely implement the latter option.
I'm not sure if that option is the best solution.
What if a task is suspended (several times). I think in your option, it will influence the event times and averages.

Did you already test v2.2.0?
Unlike v2.1.0 v2.2.0 uses the modified "time left" algorithm based on event runtimes instead of averages from the logfiles.
If you pause the task via BOINC manager - even for a longer period - this has no effect on the ATLAS runtime logging.
Hence after a resume the monitoring shows the estimated remaining uptime ignoring all suspend/resumes.

Be aware that the lines showing the finished events are citations from the original logfiles (except the worker number).
Hence they must not be modified.

example task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2834058
2019-11-02 10:41:51 (23691): VM state change detected. (old = 'running', new = 'paused')
2019-11-02 11:57:56 (23691): VM state change detected. (old = 'paused', new = 'running')
ID: 6799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6800 - Posted: 2 Nov 2019, 12:03:36 UTC - in response to Message 6799.  

Did you already test v2.2.0?
Just returned one task and saw v2.2.0 in the info-box. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833918
I did not realize it was a new version. I did not noticed the ending, because it was lunchtime ;)
I just concentrated on the averages and the ±-values, but for me they make no real sense.
For me not that important. Most important info is the counting of done events.
ID: 6800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6801 - Posted: 2 Nov 2019, 13:14:11 UTC

Any showstoppers?
If not I'd like to ask David to deploy v2.2.0 on the prod server after the weekend.



@CP
All of your suggestions are implemented in v2.2.0.
ID: 6801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6804 - Posted: 3 Nov 2019, 10:32:55 UTC - in response to Message 6801.  

Let the show go on!

I tried to figure out the meaning of the 'New average' and the '+-' values.
The 'New average' is clear. It's the average of all done events without the very first one.
The '+-' value is not clear to me. Only the first displayed value after the 3rd event shows the right difference with the previous average,
but thereafter, I don't see any correlation between shown values and the 'New averages' except it's going up or down.
ID: 6804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : ATLAS Application : ATLAS Monitoring v2.0.0


©2024 CERN