Message boards : ATLAS Application : vbox console monitoring with 3.01
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7986 - Posted: 20 Mar 2023, 13:07:14 UTC

The event processing monitoring in console 2 was not working with the new tasks we were trying here. This was due to a slightly different logging format in the new ATLAS software version. I have changed the monitoring tool to work with both old and new formats, and submitted some new tasks (shorter with 50 events each). It would be great if someone can confirm if the console works now or not (it's hard for me to test it myself).
ID: 7986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richie_unstable

Send message
Joined: 31 Aug 21
Posts: 13
Credit: 1,118,469
RAC: 0
Message 7988 - Posted: 20 Mar 2023, 13:39:33 UTC

I would do that but I'm not sure where or how. What keys to press or where to point'n click ?
ID: 7988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7989 - Posted: 20 Mar 2023, 13:56:16 UTC - in response to Message 7986.  

As already mentioned here
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=614&postid=7964#7964
it's not that easy.

The monitoring scripts assume a singlecore task although I'm running a 4-core VM.
Beside that the log entries written to log.EVNTtoHITS look a bit "messed", meaning they don't follow a uniq pattern (recently seen in a native task).
This needs to be sorted, but I'd prefer to have a stable event processing first.


BTW
The dev task I'm currently running still requests Frontier data from atlasfrontier-ai.cern.ch instead of atlascern-frontier.openhtc.io.
ID: 7989 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7990 - Posted: 20 Mar 2023, 14:11:44 UTC - in response to Message 7988.  

1. open your VirtualBox Manager
2. click on the VM you want to look into
3. click the "show" button (and wait)
4. once the console window is open, use ALT + Fn to switch between the consoles
F2: Progress Monitoring
F3: top

To close the console window use "Detach GUI" from the "Machine" Menu.
Never use other methods since those would tell BOINC to suspend/resume the VM which puts a high load on the computer.
ID: 7990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7991 - Posted: 20 Mar 2023, 14:34:34 UTC

Other findings

The differencing image grows to at least 1.7 GB.
825 MB are caused by swap usage since the VM has only 2241 MB RAM.


Not sure if the VM's CVMFS cache has been cleaned and refilled with ATLAS v3 data.
This is required to
- reduce the initial VDI size
- keep the differencing image small
- result in a faster startup phase
ID: 7991 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 3
Message 7992 - Posted: 20 Mar 2023, 15:26:16 UTC
Last modified: 20 Mar 2023, 15:55:11 UTC

The monitoring is not working very well.
I'm running a dual core VM, 4800 MB RAM
Only 1 worker is displayed (I also see only 1 python process almost 200% CPU)
Total number of events is displayed (50)
already finished, mean, min, max, estimated time left is not displayed and every 60 seconds several lines of text are flashing (not readable) and cleared.
Worker 1 Event showing showed 2nd, 4th and then 3th event ?? for this worker took ### s (### is changing now and then e.g. 421)

My differencing file 900MB

Edit picture and no swap used, but initial setup etc lasted over 30 minutes.
Flashing text:
ID: 7992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7993 - Posted: 20 Mar 2023, 15:36:32 UTC

Another 4-core task is in progress.
Manually set the VM's RAM size to 3900 MB (the default used for older 1-core VMs).

Top now shows 776 kB swap being used.
The differencing image uses around 880 MB and grows slowly while the events are being processed.
The main python process uses 2.2 GB RAM and close to 400% CPU which corresponds to the 4-core setup.
ID: 7993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7994 - Posted: 20 Mar 2023, 15:40:54 UTC - in response to Message 7992.  

Total number of events is displayed (50)
already finished, mean, min, max, estimated time left is not displayed and every 60 seconds several lines of text are flashing (not readable) and cleared.
Worker 1 Event showing showed 2nd, 4th and then 3th event ?? for this worker took ### s (### is changing now and then e.g. 421)


Edit picture and no swap used
Flashing text:...

Same here
ID: 7994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7995 - Posted: 20 Mar 2023, 15:56:01 UTC

Not sure if the VM's CVMFS cache has been cleaned and refilled with ATLAS v3 data.


I ran a v3 task inside the existing prod image (since we'll have to run both in parallel for some time) and used that image here, so the software should be cached. This is why the image is much larger than before (4.4GB vs 2.8GB).

825 MB are caused by swap usage since the VM has only 2241 MB RAM.


This is not intentional, something is not configured correctly after removing the per-core memory scaling. I am checking it.

The monitoring scripts assume a singlecore task although I'm running a 4-core VM.


I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper.

The dev task I'm currently running still requests Frontier data from atlasfrontier-ai.cern.ch instead of atlascern-frontier.openhtc.io.


This was changed last week but I forgot to restart one service to pick up the changes. New tasks should have the correct frontier.
ID: 7995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 3
Message 7996 - Posted: 20 Mar 2023, 16:03:18 UTC - in response to Message 7995.  

The readable text is showing this:
ID: 7996 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7997 - Posted: 20 Mar 2023, 16:08:06 UTC - in response to Message 7995.  

I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper.

@ David
It is not only the search pattern.
An additional point is that the old monitoring expects the result lines in order.
This was guaranteed in singlecore mode within the main log as well as in multicore mode within each of the worker logs.

Now the log entries are no longer in order and this fact has to be respected by the monitoring.

Another point that needs to be checked:
The timing averages reported by the workers refer to the worker thread they are coming from.
Hence, they are not valid to calculate the total average.
ID: 7997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7998 - Posted: 20 Mar 2023, 16:15:37 UTC

CP's screenshots show what I described as "messed" logfile lines.
See: "worker 1..."

As a result the monitoring script can't extract the runtime (here: 338 s)
This leads to the missing values now reported as "N/A".
A side effect results in the "flashing text" CP already reported.
ID: 7998 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 7999 - Posted: 20 Mar 2023, 16:25:00 UTC - in response to Message 7995.  

since we'll have to run both in parallel for some time

Did the old version change it's logging behaviour?
If not, we would need to keep the old monitoring branch and call it if an old ATLAS version is processed.
ID: 7999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 3
Message 8007 - Posted: 21 Mar 2023, 10:05:11 UTC - in response to Message 7996.  
Last modified: 21 Mar 2023, 10:06:55 UTC

Crystal seeing the same in Console F2.
Now 50 instead of 500 Collisions.
ID: 8007 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : ATLAS Application : vbox console monitoring with 3.01


©2024 CERN