Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Not sure if I correctly understand the new RAM setting for ATLAS VMs. We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
Sorry, but I'm against these smaller tasks. I think it was a good balance with the 200 Events-WUs and vote for 500 events in future we had the longrunner -native in Linux with 1.000 Collisions. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
KISS (keep it simple and stupid) There's no need to maintain (and explain) an additional (very long running!) app_version just to pack some more events into a task. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
I'm still getting Frontier data from atlasfrontier-ai.cern.ch. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
From the ATLAS side we prefer longer tasks mainly because it makes bookkeeping and data transfer easier, and in fact on the ATLAS grid we run 2000 event tasks now. But of course that is a very different environment and it's more important to keep volunteers happy here. This Info is from David. The Atlas Project Team is the Analyzer. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
David Cameron wrote: We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app. It ran fine but had little RAM headroom. If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache: 2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I had 16 dual core tasks, running 8 concurrently. VM setup to use 3500MB RAM. 15 of them did not use swap at all. 1 used 8 kByte of the swapping page. I want to try to do this with 4-core tasks, but no tasks available. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I'm still getting Frontier data from atlasfrontier-ai.cern.ch. Can you give an example task where you see that? In recent tasks from the last few days I see export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=... |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me Unfortunately our software still does not support proper checkpointing despite some efforts to implement it over the years. I would like to re-introduce the long running tasks for people with dedicated reliable resources but unfortunately I don't have time to maintain two app versions at the moment. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
David Cameron wrote:We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. This sounds reasonable. I would like if possible to require slightly less than 4GB so as not to exclude hosts which have 4GB total RAM (usually ~3.8GB usable). But maybe it is unrealistic for those hosts to run their native OS, vbox OS and ATLAS task all within 4GB. Edit: the memory here for new tasks is now set to 4GB + 0.1GB * ncores |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid. first and last: [21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT . . . [21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT The request were sent by this task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547 All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested. ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I got a 500 events task on my dual core / 4 threaded laptop. I configured it to run a 4 core VM with 3500 MB RAM. Swap so far used 'only' 1024kB. After 2 hours uptime 49 events out the 500 are ready. Host's VBoxHeadless.exe cpu-priority set to below normal, but of course laptop gets a bit sluggish. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid. Hmm, I really don't understand this. I checked the log of this task and it looks correct: 2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on.... and this configuration is the same in dev and prod. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
The request were sent by this task: This entries are different: 2023-03-21 10:53:07 (65540): Guest Log: Copied input files into RunAtlas. 2023-03-21 10:53:08 (65540): Guest Log: Detected user-configured HTTP proxy at http://<hostname_censored_by_volunteer/>:3128 - will set in /etc/cvmfs/default.local 2023-03-21 10:53:21 (65540): Guest Log: Running cvmfs_config stat atlas.cern.ch 2023-03-21 10:53:22 (65540): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2023-03-21 10:53:22 (65540): Guest Log: 2.6.3.0 1561 0 29644 116996 3 1 3116360 4096000 0 65024 0 0 n/a 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://192.0.2.52:3128 1 |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
This is not related to the Frontier issue. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I realise what's going on now. In dev I had changed the way the setup was done at the start of the job and so I think the setup.sh.local file was ignored. I have changed this now to set things properly, and also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
Hmm, I really don't understand this. I checked the log of this task and it looks correct: setup.sh.local shows the Frontier setup that is intended to be used, but the setup that is really used is reported in log.EVNTtoHITS. The old ATLAS version reports the Frontier proxies there, like: 07:10:02 DBReplicaSvc INFO Frontier server at (serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)... Do you have access to the corresponding log.EVNTtoHITS from the example task? atlasfrontier-ai.cern.ch would be used as 1st fallback if atlascern-frontier.openhtc.io doesn't respond. Very unlikely that within a couple of days all dev tasks run into fail-over while all prod tasks don't. <edit> Sorry, didn't notice your recent post. </edit> |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
... also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically. This needs to be tested, especially with complex wpad files. Frontier clients/pacparser libs run into problems if the server list gets too long. Please provide some test tasks to check the runtime logs for related Frontier errors/warnings. <edit> If possible, not the 500 event ones. </edit> |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I have submitted a bunch of 20 event tasks. |
©2024 CERN