ATLAS vbox and native 3.01

Author	Message
David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8002 - Posted: 21 Mar 2023, 7:53:16 UTC - in response to Message 8001. Not sure if I correctly understand the new RAM setting for ATLAS VMs. Old: RAM is set according to 3000 MB + 900 MB * #cores New: Fixed RAM setting, currently 2241 MB, may become 4000 MB (?) Since the new VM should be able to run either old/new ATLAS version how will the RAM setting be configured? We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. ID: 8002 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 675 Credit: 1,976,951 RAC: 1,683	Message 8003 - Posted: 21 Mar 2023, 9:19:03 UTC - in response to Message 7985. Sorry, but I'm against these smaller tasks. I think it was a good balance with the 200 Events-WUs and vote for 500 events in future Perhaps you can make it configurable in LHC-Preferences and the user can switch between 200 and 500 tasks, that could help. we had the longrunner -native in Linux with 1.000 Collisions. Is it possible to reactivate it, maybe with those 2.000? So, we can be able to activate it or not. ID: 8003 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8004 - Posted: 21 Mar 2023, 9:45:07 UTC - in response to Message 8003. KISS (keep it simple and stupid) There's no need to maintain (and explain) an additional (very long running!) app_version just to pack some more events into a task. ID: 8004 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8005 - Posted: 21 Mar 2023, 10:00:45 UTC I'm still getting Frontier data from atlasfrontier-ai.cern.ch. ID: 8005 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 675 Credit: 1,976,951 RAC: 1,683	Message 8006 - Posted: 21 Mar 2023, 10:04:13 UTC - in response to Message 7984. Last modified: 21 Mar 2023, 10:15:13 UTC From the ATLAS side we prefer longer tasks mainly because it makes bookkeeping and data transfer easier, and in fact on the ATLAS grid we run 2000 event tasks now. But of course that is a very different environment and it's more important to keep volunteers happy here. This Info is from David. The Atlas Project Team is the Analyzer. ID: 8006 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0	Message 8008 - Posted: 21 Mar 2023, 10:55:24 UTC with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me ID: 8008 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8009 - Posted: 21 Mar 2023, 11:20:09 UTC - in response to Message 8002. David Cameron wrote: We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app. It ran fine but had little RAM headroom. If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache: 2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total) ID: 8009 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 8010 - Posted: 21 Mar 2023, 12:08:46 UTC I had 16 dual core tasks, running 8 concurrently. VM setup to use 3500MB RAM. 15 of them did not use swap at all. 1 used 8 kByte of the swapping page. I want to try to do this with 4-core tasks, but no tasks available. ID: 8010 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8011 - Posted: 21 Mar 2023, 16:03:34 UTC - in response to Message 8005. I'm still getting Frontier data from atlasfrontier-ai.cern.ch. Can you give an example task where you see that? In recent tasks from the last few days I see export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=... ID: 8011 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8012 - Posted: 21 Mar 2023, 16:07:19 UTC - in response to Message 8008. with stabil checkpointing 2.000 events with 3.01 wouldn't be a problem for a lot of volunteers inclusive me Unfortunately our software still does not support proper checkpointing despite some efforts to implement it over the years. I would like to re-introduce the long running tasks for people with dedicated reliable resources but unfortunately I don't have time to maintain two app versions at the moment. ID: 8012 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8013 - Posted: 21 Mar 2023, 16:10:35 UTC - in response to Message 8009. Last modified: 21 Mar 2023, 16:11:27 UTC David Cameron wrote: We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app. It ran fine but had little RAM headroom. If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache: 2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total) This sounds reasonable. I would like if possible to require slightly less than 4GB so as not to exclude hosts which have 4GB total RAM (usually ~3.8GB usable). But maybe it is unrealistic for those hosts to run their native OS, vbox OS and ATLAS task all within 4GB. Edit: the memory here for new tasks is now set to 4GB + 0.1GB * ncores ID: 8013 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8014 - Posted: 21 Mar 2023, 16:27:09 UTC - in response to Message 8011. My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid. first and last: [21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT . . . [21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT The request were sent by this task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547 All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested. ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual. ID: 8014 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,809 RAC: 17	Message 8015 - Posted: 21 Mar 2023, 17:06:20 UTC - in response to Message 8013. Last modified: 21 Mar 2023, 17:08:42 UTC I got a 500 events task on my dual core / 4 threaded laptop. I configured it to run a 4 core VM with 3500 MB RAM. Swap so far used 'only' 1024kB. After 2 hours uptime 49 events out the 500 are ready. Host's VBoxHeadless.exe cpu-priority set to below normal, but of course laptop gets a bit sluggish. ID: 8015 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8016 - Posted: 21 Mar 2023, 20:33:35 UTC - in response to Message 8014. My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid. first and last: [21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT . . . [21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT The request were sent by this task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547 All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested. ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual. Hmm, I really don't understand this. I checked the log of this task and it looks correct: 2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on.... and this configuration is the same in dev and prod. ID: 8016 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 675 Credit: 1,976,951 RAC: 1,683	Message 8017 - Posted: 21 Mar 2023, 22:47:26 UTC The request were sent by this task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547 This entries are different: 2023-03-21 10:53:07 (65540): Guest Log: Copied input files into RunAtlas. 2023-03-21 10:53:08 (65540): Guest Log: Detected user-configured HTTP proxy at http://<hostname_censored_by_volunteer/>:3128 - will set in /etc/cvmfs/default.local 2023-03-21 10:53:21 (65540): Guest Log: Running cvmfs_config stat atlas.cern.ch 2023-03-21 10:53:22 (65540): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2023-03-21 10:53:22 (65540): Guest Log: 2.6.3.0 1561 0 29644 116996 3 1 3116360 4096000 0 65024 0 0 n/a 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://192.0.2.52:3128 1 ID: 8017 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8019 - Posted: 22 Mar 2023, 8:58:41 UTC - in response to Message 8017. This is not related to the Frontier issue. ID: 8019 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8020 - Posted: 22 Mar 2023, 9:33:29 UTC - in response to Message 8016. I realise what's going on now. In dev I had changed the way the setup was done at the start of the job and so I think the setup.sh.local file was ignored. I have changed this now to set things properly, and also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically. ID: 8020 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8021 - Posted: 22 Mar 2023, 9:41:29 UTC Last modified: 22 Mar 2023, 9:43:03 UTC Hmm, I really don't understand this. I checked the log of this task and it looks correct: 2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on.... and this configuration is the same in dev and prod. setup.sh.local shows the Frontier setup that is intended to be used, but the setup that is really used is reported in log.EVNTtoHITS. The old ATLAS version reports the Frontier proxies there, like: 07:10:02 DBReplicaSvc INFO Frontier server at (serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)... Do you have access to the corresponding log.EVNTtoHITS from the example task? atlasfrontier-ai.cern.ch would be used as 1st fallback if atlascern-frontier.openhtc.io doesn't respond. Very unlikely that within a couple of days all dev tasks run into fail-over while all prod tasks don't. <edit> Sorry, didn't notice your recent post. </edit> ID: 8021 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 481 Credit: 394,720 RAC: 0	Message 8022 - Posted: 22 Mar 2023, 9:55:42 UTC - in response to Message 8020. Last modified: 22 Mar 2023, 9:57:03 UTC ... also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically. This needs to be tested, especially with complex wpad files. Frontier clients/pacparser libs run into problems if the server list gets too long. Please provide some test tasks to check the runtime logs for related Frontier errors/warnings. <edit> If possible, not the 500 event ones. </edit> ID: 8022 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 8023 - Posted: 22 Mar 2023, 11:03:30 UTC - in response to Message 8022. I have submitted a bunch of 20 event tasks. ID: 8023 · Rating: 0 · rate: / Reply Quote

Development for LHC@home