Message boards : ATLAS Application : ATLAS vbox and native 3.01
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8002 - Posted: 21 Mar 2023, 7:53:16 UTC - in response to Message 8001.  

Not sure if I correctly understand the new RAM setting for ATLAS VMs.

Old:
RAM is set according to 3000 MB + 900 MB * #cores

New:
Fixed RAM setting, currently 2241 MB, may become 4000 MB (?)

Since the new VM should be able to run either old/new ATLAS version how will the RAM setting be configured?


We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.
ID: 8002 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 8003 - Posted: 21 Mar 2023, 9:19:03 UTC - in response to Message 7985.  

Sorry, but I'm against these smaller tasks. I think it was a good balance with the 200 Events-WUs and vote for 500 events in future

Perhaps you can make it configurable in LHC-Preferences and the user can switch between 200 and 500 tasks, that could help.


we had the longrunner -native in Linux with 1.000 Collisions.
Is it possible to reactivate it, maybe with those 2.000?
So, we can be able to activate it or not.
ID: 8003 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8004 - Posted: 21 Mar 2023, 9:45:07 UTC - in response to Message 8003.  

KISS (keep it simple and stupid)
There's no need to maintain (and explain) an additional (very long running!) app_version just to pack some more events into a task.
ID: 8004 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8005 - Posted: 21 Mar 2023, 10:00:45 UTC

I'm still getting Frontier data from atlasfrontier-ai.cern.ch.
ID: 8005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 8006 - Posted: 21 Mar 2023, 10:04:13 UTC - in response to Message 7984.  
Last modified: 21 Mar 2023, 10:15:13 UTC

From the ATLAS side we prefer longer tasks mainly because it makes bookkeeping and data transfer easier, and in fact on the ATLAS grid we run 2000 event tasks now. But of course that is a very different environment and it's more important to keep volunteers happy here.

This Info is from David.
The Atlas Project Team is the Analyzer.
ID: 8006 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 8008 - Posted: 21 Mar 2023, 10:55:24 UTC

with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me
ID: 8008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8009 - Posted: 21 Mar 2023, 11:20:09 UTC - in response to Message 8002.  

David Cameron wrote:
We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.

I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app.
It ran fine but had little RAM headroom.

If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache:
2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total)
ID: 8009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 61
Message 8010 - Posted: 21 Mar 2023, 12:08:46 UTC

I had 16 dual core tasks, running 8 concurrently. VM setup to use 3500MB RAM.
15 of them did not use swap at all. 1 used 8 kByte of the swapping page.
I want to try to do this with 4-core tasks, but no tasks available.
ID: 8010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8011 - Posted: 21 Mar 2023, 16:03:34 UTC - in response to Message 8005.  

I'm still getting Frontier data from atlasfrontier-ai.cern.ch.


Can you give an example task where you see that? In recent tasks from the last few days I see

export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=...
ID: 8011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8012 - Posted: 21 Mar 2023, 16:07:19 UTC - in response to Message 8008.  

with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me


Unfortunately our software still does not support proper checkpointing despite some efforts to implement it over the years. I would like to re-introduce the long running tasks for people with dedicated reliable resources but unfortunately I don't have time to maintain two app versions at the moment.
ID: 8012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8013 - Posted: 21 Mar 2023, 16:10:35 UTC - in response to Message 8009.  
Last modified: 21 Mar 2023, 16:11:27 UTC

David Cameron wrote:
We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.

I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app.
It ran fine but had little RAM headroom.

If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache:
2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total)


This sounds reasonable. I would like if possible to require slightly less than 4GB so as not to exclude hosts which have 4GB total RAM (usually ~3.8GB usable). But maybe it is unrealistic for those hosts to run their native OS, vbox OS and ATLAS task all within 4GB.

Edit: the memory here for new tasks is now set to 4GB + 0.1GB * ncores
ID: 8013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8014 - Posted: 21 Mar 2023, 16:27:09 UTC - in response to Message 8011.  

My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid.
first and last:
[21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT
.
.
.
[21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT

The request were sent by this task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547
All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested.


ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual.
ID: 8014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 61
Message 8015 - Posted: 21 Mar 2023, 17:06:20 UTC - in response to Message 8013.  
Last modified: 21 Mar 2023, 17:08:42 UTC

I got a 500 events task on my dual core / 4 threaded laptop.
I configured it to run a 4 core VM with 3500 MB RAM.
Swap so far used 'only' 1024kB. After 2 hours uptime 49 events out the 500 are ready.
Host's VBoxHeadless.exe cpu-priority set to below normal, but of course laptop gets a bit sluggish.
ID: 8015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8016 - Posted: 21 Mar 2023, 20:33:35 UTC - in response to Message 8014.  

My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid.
first and last:
[21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT
.
.
.
[21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT

The request were sent by this task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547
All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested.


ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual.


Hmm, I really don't understand this. I checked the log of this task and it looks correct:

2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local
export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on....

and this configuration is the same in dev and prod.
ID: 8016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 8017 - Posted: 21 Mar 2023, 22:47:26 UTC

The request were sent by this task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547

This entries are different:

2023-03-21 10:53:07 (65540): Guest Log: Copied input files into RunAtlas.
2023-03-21 10:53:08 (65540): Guest Log: Detected user-configured HTTP proxy at http://<hostname_censored_by_volunteer/>:3128 - will set in /etc/cvmfs/default.local
2023-03-21 10:53:21 (65540): Guest Log: Running cvmfs_config stat atlas.cern.ch
2023-03-21 10:53:22 (65540): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2023-03-21 10:53:22 (65540): Guest Log: 2.6.3.0 1561 0 29644 116996 3 1 3116360 4096000 0 65024 0 0 n/a 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://192.0.2.52:3128 1
ID: 8017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8019 - Posted: 22 Mar 2023, 8:58:41 UTC - in response to Message 8017.  

This is not related to the Frontier issue.
ID: 8019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8020 - Posted: 22 Mar 2023, 9:33:29 UTC - in response to Message 8016.  

I realise what's going on now. In dev I had changed the way the setup was done at the start of the job and so I think the setup.sh.local file was ignored.

I have changed this now to set things properly, and also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically.
ID: 8020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8021 - Posted: 22 Mar 2023, 9:41:29 UTC
Last modified: 22 Mar 2023, 9:43:03 UTC

Hmm, I really don't understand this. I checked the log of this task and it looks correct:

2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local
export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on....

and this configuration is the same in dev and prod.


setup.sh.local shows the Frontier setup that is intended to be used, but the setup that is really used is reported in log.EVNTtoHITS.
The old ATLAS version reports the Frontier proxies there, like:
07:10:02 DBReplicaSvc         INFO Frontier server at (serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)...

Do you have access to the corresponding log.EVNTtoHITS from the example task?


atlasfrontier-ai.cern.ch would be used as 1st fallback if atlascern-frontier.openhtc.io doesn't respond.
Very unlikely that within a couple of days all dev tasks run into fail-over while all prod tasks don't.

<edit>
Sorry, didn't notice your recent post.
</edit>
ID: 8021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 8022 - Posted: 22 Mar 2023, 9:55:42 UTC - in response to Message 8020.  
Last modified: 22 Mar 2023, 9:57:03 UTC

... also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically.

This needs to be tested, especially with complex wpad files.
Frontier clients/pacparser libs run into problems if the server list gets too long.

Please provide some test tasks to check the runtime logs for related Frontier errors/warnings.

<edit>
If possible, not the 500 event ones.
</edit>
ID: 8022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8023 - Posted: 22 Mar 2023, 11:03:30 UTC - in response to Message 8022.  

I have submitted a bunch of 20 event tasks.
ID: 8023 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : ATLAS Application : ATLAS vbox and native 3.01


©2024 CERN