Message boards :
ATLAS Application :
ATLAS Monitoring v2.0.0
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
An improved version of the ALT-F2 and ALT-F3 monitoring is now online for testing. David provided some tasks with 10 events each. Comments are welcome. Changes ALT-F3 Minor changes to ensure the top console doesn't activate power saving mode or screensaver. ALT-F2 - modified layout - avoids flicker while display updates - modified (smoother) "time left" algorithm - a bit more verbose when calculation has finished - background part (logfile dumping) is now a systemd service that is tied to the foreground service. Hitting CTRL-c at tty2 will now cleanup the monitoring directories and restart both services from the scratch. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 46 |
My first test failed -> https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833924 Although task manager said 9GB available, I'll stop my Linux VM to free up memory. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 46 |
An improved version of the ALT-F2 and ALT-F3 monitoring is now online for testing.Comments to ALT-F2: - I saw once a difference between 'Number of events already finished' and total of the worker events. In my case 2 already finished and 3 in the workers. ... Probably worker finished just after the sum was counted. Not a real issue. (have an image of it) - Better reading when (vbox) in first line is placed towards end: ATLAS Event Progress Monitoring (vbox). Not a real issue too or skip (vbox) at all. - When calculating the new averages after 2 or more events in one worker have finished the time of the first event is not taken into the sum. ... So after the 2nd event has finished the average is of the 2nd event only (the same value) and after 3 events the average shown is the average of event 2 and 3. - When the last event is processed the display is cleared and the last events and their averages are not longer visible. Would it be possible to keep the last worker events in display and ... show 'Calculation completed. Preparing HITS file ...' below it? |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Thanks for testing. Great to have a discussion here! - I saw once a difference between 'Number of events already finished' and total of the worker events. In my case 2 already finished and 3 in the workers. Right. That's due to "lazy" programming @^-^@ . I expected that to happen, but very very rarely. Should be fixed to avoid confusion. - Better reading when (vbox) in first line is placed towards end: ATLAS Event Progress Monitoring (vbox). Not a real issue too or skip (vbox) at all. Will remove (vbox) if nobody disagrees. - When calculating the new averages after 2 or more events in one worker have finished the time of the first event is not taken into the sum. Aah! Right! Didn't notice that before but just checked an ATLAS native logfile - it's the very same there: 2019-10-31 07:27:11,349 ISFG4SimSvc INFO Event nr. 1 took 588.7 s. New average 588.7 +- 0 2019-10-31 07:41:34,960 ISFG4SimSvc INFO Event nr. 2 took 744.5 s. New average 744.5 +- 0 2019-10-31 07:55:30,289 ISFG4SimSvc INFO Event nr. 3 took 770.4 s. New average 757.4 +- 12.94 2019-10-31 08:05:18,345 ISFG4SimSvc INFO Event nr. 4 took 546.1 s. New average 687 +- 70.82 2019-10-31 08:17:09,359 ISFG4SimSvc INFO Event nr. 5 took 612.2 s. New average 668.3 +- 53.45 So it looks like you stumbled over an old (?) error inside the scientific app. @David Cameron Be so kind as to forward that issue to the ATLAS developers. Regarding the monitoring there are 2 options - not to change anything and to live with the wrong average until the app is corrected - calculate the averages from the runtimes in the worker logs Will most likely implement the latter option. - When the last event is processed the display is cleared and the last events and their averages are not longer visible. Would it be possible to keep the last worker events in display and Not much effort. Can do that if nobody disagrees. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 46 |
Regarding the monitoring there are 2 optionsI'm not sure if that option is the best solution. What if a task is suspended (several times). I think in your option, it will influence the event times and averages. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Regarding the monitoring there are 2 optionsI'm not sure if that option is the best solution. Did you already test v2.2.0? Unlike v2.1.0 v2.2.0 uses the modified "time left" algorithm based on event runtimes instead of averages from the logfiles. If you pause the task via BOINC manager - even for a longer period - this has no effect on the ATLAS runtime logging. Hence after a resume the monitoring shows the estimated remaining uptime ignoring all suspend/resumes. Be aware that the lines showing the finished events are citations from the original logfiles (except the worker number). Hence they must not be modified. example task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2834058 2019-11-02 10:41:51 (23691): VM state change detected. (old = 'running', new = 'paused') 2019-11-02 11:57:56 (23691): VM state change detected. (old = 'paused', new = 'running') |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 46 |
Did you already test v2.2.0?Just returned one task and saw v2.2.0 in the info-box. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833918 I did not realize it was a new version. I did not noticed the ending, because it was lunchtime ;) I just concentrated on the averages and the ±-values, but for me they make no real sense. For me not that important. Most important info is the counting of done events. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Any showstoppers? If not I'd like to ask David to deploy v2.2.0 on the prod server after the weekend. @CP All of your suggestions are implemented in v2.2.0. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 46 |
Let the show go on! I tried to figure out the meaning of the 'New average' and the '+-' values. The 'New average' is clear. It's the average of all done events without the very first one. The '+-' value is not clear to me. Only the first displayed value after the 3rd event shows the right difference with the previous average, but thereafter, I don't see any correlation between shown values and the 'New averages' except it's going up or down. |
©2024 CERN