1) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8048)
Posted 28 Mar 2023 by David Cameron
Post:
I was told that will likely run out of Run 2 simulation tasks to run on the prod project very soon, so I have gone ahead and released version 3 there so we can start running Run 3 tasks.

Unfortunately I don't think we'll be able to resolve some of the remaining issues like the console monitoring before going live on prod but I think it's better to have something not quite perfect than no tasks at all.
2) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8036)
Posted 22 Mar 2023 by David Cameron
Post:
Seeing 1,08 GByte download in production.
Is this new Version transfered from -dev?
Now a second download on the same PC with 1,09GByte in production.
Application for Atlas in prod is the old one??
In Germany we say "Holland in Not".


No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now.
3) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8035)
Posted 22 Mar 2023 by David Cameron
Post:
Seems the wpads need more testing. I've reverted back to the previous settings.
4) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8030)
Posted 22 Mar 2023 by David Cameron
Post:
My ATLAS colleagues told me there is an environment variable to set so that the correct libraries are included when setting up the ATLAS software release. I've set this now ans submitted some new tasks.
5) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8027)
Posted 22 Mar 2023 by David Cameron
Post:
There is a missing library for handling the wpads:

EVNTtoHITS 14:06:46 error [pacparser-dlopen.c:57]: config error: cannot dlopen libpacparser.so.1: cannot open shared object file: No such file or directory

I've cancelled all the bad WU in the system, please abort any running tasks.
6) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8023)
Posted 22 Mar 2023 by David Cameron
Post:
I have submitted a bunch of 20 event tasks.
7) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8020)
Posted 22 Mar 2023 by David Cameron
Post:
I realise what's going on now. In dev I had changed the way the setup was done at the start of the job and so I think the setup.sh.local file was ignored.

I have changed this now to set things properly, and also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically.
8) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8016)
Posted 21 Mar 2023 by David Cameron
Post:
My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid.
first and last:
[21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT
.
.
.
[21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT

The request were sent by this task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547
All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested.


ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual.


Hmm, I really don't understand this. I checked the log of this task and it looks correct:

2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local
export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on....

and this configuration is the same in dev and prod.
9) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8013)
Posted 21 Mar 2023 by David Cameron
Post:
David Cameron wrote:
We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.

I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app.
It ran fine but had little RAM headroom.

If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache:
2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total)


This sounds reasonable. I would like if possible to require slightly less than 4GB so as not to exclude hosts which have 4GB total RAM (usually ~3.8GB usable). But maybe it is unrealistic for those hosts to run their native OS, vbox OS and ATLAS task all within 4GB.

Edit: the memory here for new tasks is now set to 4GB + 0.1GB * ncores
10) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8012)
Posted 21 Mar 2023 by David Cameron
Post:
with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me


Unfortunately our software still does not support proper checkpointing despite some efforts to implement it over the years. I would like to re-introduce the long running tasks for people with dedicated reliable resources but unfortunately I don't have time to maintain two app versions at the moment.
11) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8011)
Posted 21 Mar 2023 by David Cameron
Post:
I'm still getting Frontier data from atlasfrontier-ai.cern.ch.


Can you give an example task where you see that? In recent tasks from the last few days I see

export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=...
12) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8002)
Posted 21 Mar 2023 by David Cameron
Post:
Not sure if I correctly understand the new RAM setting for ATLAS VMs.

Old:
RAM is set according to 3000 MB + 900 MB * #cores

New:
Fixed RAM setting, currently 2241 MB, may become 4000 MB (?)

Since the new VM should be able to run either old/new ATLAS version how will the RAM setting be configured?


We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.
13) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7995)
Posted 20 Mar 2023 by David Cameron
Post:
Not sure if the VM's CVMFS cache has been cleaned and refilled with ATLAS v3 data.


I ran a v3 task inside the existing prod image (since we'll have to run both in parallel for some time) and used that image here, so the software should be cached. This is why the image is much larger than before (4.4GB vs 2.8GB).

825 MB are caused by swap usage since the VM has only 2241 MB RAM.


This is not intentional, something is not configured correctly after removing the per-core memory scaling. I am checking it.

The monitoring scripts assume a singlecore task although I'm running a 4-core VM.


I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper.

The dev task I'm currently running still requests Frontier data from atlasfrontier-ai.cern.ch instead of atlascern-frontier.openhtc.io.


This was changed last week but I forgot to restart one service to pick up the changes. New tasks should have the correct frontier.
14) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7986)
Posted 20 Mar 2023 by David Cameron
Post:
The event processing monitoring in console 2 was not working with the new tasks we were trying here. This was due to a slightly different logging format in the new ATLAS software version. I have changed the monitoring tool to work with both old and new formats, and submitted some new tasks (shorter with 50 events each). It would be great if someone can confirm if the console works now or not (it's hard for me to test it myself).
15) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7984)
Posted 20 Mar 2023 by David Cameron
Post:
Thanks for the input, then we will stay with 200 event tasks when this new version goes to prod.

From the ATLAS side we prefer longer tasks mainly because it makes bookkeeping and data transfer easier, and in fact on the ATLAS grid we run 2000 event tasks now. But of course that is a very different environment and it's more important to keep volunteers happy here.
16) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7981)
Posted 20 Mar 2023 by David Cameron
Post:
The new ATLAS software is much faster than the old version, so each event takes 3 minutes compared to 5 minutes on average previously. This is why I tried 500 event tasks here compared to the usual 200 on prod.

The downside of faster software is a larger HITS file to upload, so please let me know your opinions on whether you prefer quick tasks with smaller output files or longer tasks with larger output files.
17) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7973)
Posted 16 Mar 2023 by David Cameron
Post:
I think the others are referring to the time between starting the task and the task using all the CPU cores for processing, which is 5-20 mins depending on load/location/connection.

The vbox tasks at the moment are getting the wrong memory setting (2241MB) - they still succeed but are using swap space so are slower. I'm working on fixing this.
18) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7969)
Posted 15 Mar 2023 by David Cameron
Post:
I am trying out setting the total memory to 4GB (independent of #cores) for vbox tasks, since I see the memory used is always around 2.3GB. Let me know if you see any problems due to this.
19) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7968)
Posted 15 Mar 2023 by David Cameron
Post:
Those warnings can be ignored, and the first lines about SSL certificate failures should not be present in new jobs today.

I looked in details at the logs of this task and it looked like it took 8 or 9 minutes to get going full steam. But still it could be that the initialisation phase is indeed longer in this new software.
20) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 7963)
Posted 15 Mar 2023 by David Cameron
Post:
Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.

2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


Right, this is an error on my part, I will fix it.

3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.


This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested)

This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now?


Next 20


©2024 CERN