61) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7997)
Posted 20 Mar 2023 by computezrmle
Post:
I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper.

@ David
It is not only the search pattern.
An additional point is that the old monitoring expects the result lines in order.
This was guaranteed in singlecore mode within the main log as well as in multicore mode within each of the worker logs.

Now the log entries are no longer in order and this fact has to be respected by the monitoring.

Another point that needs to be checked:
The timing averages reported by the workers refer to the worker thread they are coming from.
Hence, they are not valid to calculate the total average.
62) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7994)
Posted 20 Mar 2023 by computezrmle
Post:
Total number of events is displayed (50)
already finished, mean, min, max, estimated time left is not displayed and every 60 seconds several lines of text are flashing (not readable) and cleared.
Worker 1 Event showing showed 2nd, 4th and then 3th event ?? for this worker took ### s (### is changing now and then e.g. 421)


Edit picture and no swap used
Flashing text:...

Same here
63) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7993)
Posted 20 Mar 2023 by computezrmle
Post:
Another 4-core task is in progress.
Manually set the VM's RAM size to 3900 MB (the default used for older 1-core VMs).

Top now shows 776 kB swap being used.
The differencing image uses around 880 MB and grows slowly while the events are being processed.
The main python process uses 2.2 GB RAM and close to 400% CPU which corresponds to the 4-core setup.
64) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7991)
Posted 20 Mar 2023 by computezrmle
Post:
Other findings

The differencing image grows to at least 1.7 GB.
825 MB are caused by swap usage since the VM has only 2241 MB RAM.


Not sure if the VM's CVMFS cache has been cleaned and refilled with ATLAS v3 data.
This is required to
- reduce the initial VDI size
- keep the differencing image small
- result in a faster startup phase
65) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7990)
Posted 20 Mar 2023 by computezrmle
Post:
1. open your VirtualBox Manager
2. click on the VM you want to look into
3. click the "show" button (and wait)
4. once the console window is open, use ALT + Fn to switch between the consoles
F2: Progress Monitoring
F3: top

To close the console window use "Detach GUI" from the "Machine" Menu.
Never use other methods since those would tell BOINC to suspend/resume the VM which puts a high load on the computer.
66) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7989)
Posted 20 Mar 2023 by computezrmle
Post:
As already mentioned here
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=614&postid=7964#7964
it's not that easy.

The monitoring scripts assume a singlecore task although I'm running a 4-core VM.
Beside that the log entries written to log.EVNTtoHITS look a bit "messed", meaning they don't follow a uniq pattern (recently seen in a native task).
This needs to be sorted, but I'd prefer to have a stable event processing first.


BTW
The dev task I'm currently running still requests Frontier data from atlasfrontier-ai.cern.ch instead of atlascern-frontier.openhtc.io.
67) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7983)
Posted 20 Mar 2023 by computezrmle
Post:
I would prefer quick tasks.

Agree, even on Linux running native 24/7.
As long as ATLAS native doesn't allow snapshots shorter runtimes would make it easier to plan maintenance/upgrade windows.
68) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7980)
Posted 18 Mar 2023 by computezrmle
Post:
It works

Some considerations:
- very very fast at the beginning of calculation, very very slow at the end
- over 6hrs on 4 cores mobile cpu
- 1200 points

Not so bad!

Total runtime/CPU time doesn't look too bad.
Usual tasks (on -prod) process 200 events while this one processed 500 events.
=> CPU time <180 s/event.


... very very fast at the beginning of calculation, very very slow at the end

If you refer to BOINC's progress bar this might be misleading since lots of older tasks (on -dev) processed only 5-20 tasks. Those differences always confuse BOINC's runtime estimation.
69) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7976)
Posted 16 Mar 2023 by computezrmle
Post:
You are running 3 computers
2 Windows
1 Linux

Since your Linux computer got ATLAS native "Run native if available?" is set, but this blocks vbox tasks on Windows.
To get native and vbox it is a must to connect Windows/Linux to different venues.


The fact that your Linux tasks failed is caused by a missing local CVMFS client.
70) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7970)
Posted 15 Mar 2023 by computezrmle
Post:
Just started a 2nd native task with most of the CVMFS data from the CVMFS cache and most of the Frontier data from the Squid cache.

Time to setup the working environment: ~14 min

It has to be said that my box is slightly overloaded by intention running 30 CMS VMs, 4 Theory natives, 2 other CPU tasks and 2 GPU tasks concurrently plus that 4-core ATLAS test.
Out of experience: on a partly loaded system I would expect a startup time of ~7-8 min.
71) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7964)
Posted 15 Mar 2023 by computezrmle
Post:
Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.

To test this and ensure comparable total load I'll run another task when the 1st one has finished.


2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


Right, this is an error on my part, I will fix it.

3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.


This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested)

This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now?

Monitoring needs to be revised.
The old ATLAS version wrote it's log information into a couple of files.
In case of a singlecore setup only "log.EVNTtoHITS" was used.
In case of multicore "log.EVNTtoHITS" was used for global information like the #events but the event status was logged to the worker logs.

Now its all in "log.EVNTtoHITS" but the output patterns are not consistent, e.g.:
11:18:13 ISF_Kernel_FullG4MT_QS.ISF_LongLivedGeant4Tool       108     2    INFO 	 Run:Event 410000:99609	 (29th event for this worker) took 239.9 s. New average 450.6 +- 44.52
.
.
.
11:18:14 AthenaHiveEventLoopMgr                               108     2    INFO   ===>>>  done processing event #99609, run #410000 on slot 2,  106 events processed so far  <<<===
72) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7961)
Posted 15 Mar 2023 by computezrmle
Post:
Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.
73) Message boards : General Discussion : Major Topic "Xtrack Application" (Message 7938)
Posted 9 Mar 2023 by computezrmle
Post:
@Laurence

Wouldn't it be a good idea to set up a new major topic "Xtrack Application" here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_index.php
74) Message boards : Theory Application : All errors (Message 7912)
Posted 5 Jan 2023 by computezrmle
Post:
You may have removed the VM entries but you obviously forgot to remove the disk entries.
Use the VirtualBox Media Manager to do this.
75) Message boards : Theory Application : All errors (Message 7910)
Posted 5 Jan 2023 by computezrmle
Post:
You need to manually clean the VirtualBox media registry.


1. Ensure no Theory task from -dev is running or paused
2. Stop BOINC
3. Open the VirtualBox GUI and start the Media Manager
4. Remove all child vdis connected to parent a9d19666-9f42-47d2-9b06-f58d52e3215c
5. Remove all files/folders from the slots where Theory -dev tasks had been.
6. Remove the parent vdi a9d19666-9f42-47d2-9b06-f58d52e3215c (but not it's file within the projects folder).
7. Restart BOINC

Ensure not to start too many tasks concurrently that try to initialize the same vdi as multiattach parent as this may cause a race condition.
This is caused by a workaraound vboxwrapper must use to correctly register a vdi in multiattach mode.
76) Message boards : ATLAS Application : ATLAS vbox v.1.27 (Message 7908)
Posted 8 Dec 2022 by computezrmle
Post:
This will happen occasionally.
It is caused by VirtualBox's vdi handling which moves the media entry back and forth between different configuration files.
Version 7.x makes it even worse compared to v6.1 as it violates it's own rule not to allow the multiattach attribute within the global media registry.

So far the only solution is to manually clear orphaned media entries.
77) Message boards : News : Server Release 1.4.0 (Message 7906)
Posted 3 Dec 2022 by computezrmle
Post:
Theory validator is not running.
https://lhcathomedev.cern.ch/lhcathome-dev/server_status.php
78) Message boards : ATLAS Application : ATLAS vbox v.1.26 (Message 7893)
Posted 25 Nov 2022 by computezrmle
Post:
It's just an info message, not even a warning.
And it's written by the VirtualBox core.

Should go away if the host computer's RTC and every OS (host/guests) is set to use UTC.
More details can be found in this PR:
https://github.com/BOINC/boinc/pull/4631

Vboxwrapper 26206 used here includes that patch but the clock and the regkey must correctly be set once by the user.
79) Message boards : General Discussion : Got 0 new tasks (Message 7880)
Posted 11 Nov 2022 by computezrmle
Post:
As for ATLAS:
On the prefs page (https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project) disable "Run native if available?"

As for GPU work:
The xtrack queue is empty.
AMD/ATI is an old and well known server issue to send this message if none of the subprojects sends any work.
80) Message boards : News : Server Release 1.4.0 (Message 7876)
Posted 10 Nov 2022 by computezrmle
Post:
Did you set this in your BOINC client's cc_config.xml?
<dont_check_file_sizes>1</dont_check_file_sizes>


Previous 20 · Next 20


©2024 CERN