1)
Message boards :
Theory Application :
Why is there no native_theory on this project?
(Message 8069)
Posted 28 Apr 2023 by computezrmle Post: There's currently no need. |
2)
Message boards :
ATLAS Application :
Sometimes atlas jobs use more cpu and memory than required.
(Message 8064)
Posted 27 Apr 2023 by computezrmle Post: Unfortunately David Cameron left CERN to start a new job. Among the last tests he made here were modifications to find out optimized (RAM) settings for ATLAS v3.x. Might be those tests are not yet completed and need to be finished when team has found a successor. Beside that you mentioned ATLAS native but linked to ATLAS vbox tasks. |
3)
Message boards :
ATLAS Application :
SSL certificate error in atlas tasks
(Message 8063)
Posted 27 Apr 2023 by computezrmle Post: The same suggestion came up 2 weeks ago at the -prod forum. I just replied to the post there: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5986&postid=48045 |
4)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8031)
Posted 22 Mar 2023 by computezrmle Post: Same as described here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=614&postid=8026 |
5)
Message boards :
ATLAS Application :
Huge EVNT Files
(Message 8029)
Posted 22 Mar 2023 by computezrmle Post: [22/Mar/2023:15:11:31 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//217/bM6LDmhcBz2n9Rq4apoT9bVoABFKDmABFKDmSrVUDm0NXKDmZhlDEn_EVNT.32794564._000002.pool.root.1 HTTP/1.1" 200 1166414003 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT [22/Mar/2023:15:27:33 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//1be/MBeLDmLdBz2nsSi4apGgGQJmABFKDmABFKDmjv4TDmjwcKDmTB0Dkm_EVNT.32794564._000002.pool.root.1 HTTP/1.1" 200 1166414003 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT Just noticed those huge ATLAS EVNT files being downloaded from prod to different clients: 1,166,414,003 => 1.2 GB each <edit> Next one: [22/Mar/2023:15:52:53 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//1a4/eP1NDmDOCz2np2BDcpmwOghnABFKDmABFKDmtdFKDmY3TKDmyMPKRn_EVNT.32794564._000003.pool.root.1 HTTP/1.1" 200 1164691364 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT </edit> |
6)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8028)
Posted 22 Mar 2023 by computezrmle Post: One solution for this is to add the library path to LD_LIBRARY_PATH. It must point to the lib version pacparser/pactester is linked to. Example (works for the old ATLAS version): pactester_bin="/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/bin/pactester" export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/lib The export can be made in setup.sh-local for example. |
7)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8026)
Posted 22 Mar 2023 by computezrmle Post: log.EVNTtoHITS starts but hangs after the last line: 12:16:59 Wed Mar 22 12:16:59 CET 2023 12:16:59 Preloading tcmalloc_minimal.so 12:16:59 Preloading /cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libimf.so 12:17:08 Py:Sim_tf INFO ****************** STARTING Simulation ***************** 12:17:08 Py:Sim_tf INFO **** Transformation run arguments 12:17:08 Py:Sim_tf INFO RunArguments: 12:17:08 AMITag = 's4066' 12:17:08 EVNTFileIO = 'input' 12:17:08 concurrentEvents = 4 12:17:08 conditionsTag = 'OFLCOND-MC21-SDR-RUN3-07' 12:17:08 firstEvent = 99501 12:17:08 geometryVersion = 'ATLAS-R3S-2021-03-02-00' 12:17:08 inputEVNTFile = ['EVNT.29838250._000010.pool.root.1'] 12:17:08 inputEVNTFileNentries = 10000 12:17:08 inputEVNTFileType = 'EVNT' 12:17:08 jobNumber = 200 12:17:08 maxEvents = 20 12:17:08 nprocs = 0 12:17:08 outputHITSFile = 'HITS.32413688._000229-2536910-1679482957.pool.root.1' 12:17:08 outputHITSFileType = 'HITS' 12:17:08 perfmon = 'fastmonmt' 12:17:08 postInclude = ['PyJobTransforms.UseFrontier'] 12:17:08 preInclude = ['Campaigns.MC23aSimulationMultipleIoV'] 12:17:08 randomSeed = 200 12:17:08 runNumber = 601229 12:17:08 simulator = 'FullG4MT_QS' 12:17:08 skipEvents = 9500 12:17:08 threads = 4 12:17:08 totalExecutorSteps = 0 12:17:08 trfSubstepName = 'EVNTtoHITS' 12:17:08 Py:Sim_tf INFO **** Setting-up configuration flags pilotlog.txt loops printing lines like those: 2023-03-22 11:44:12,410 | INFO | 1693.4158494472504s have passed since pilot start 2023-03-22 11:44:18,540 | INFO | monitor loop #22: job 0:5753961892 is in state 'running' 2023-03-22 11:44:20,310 | INFO | CPU consumption time for pid=121445: 10.57 (rounded to 11) 2023-03-22 11:44:20,310 | INFO | executing command: ps -opid --no-headers --ppid 121445 2023-03-22 11:44:20,469 | INFO | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist 2023-03-22 11:44:20,469 | INFO | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon) 2023-03-22 11:44:20,470 | INFO | executing command: ps aux -q 109100 2023-03-22 11:44:20,585 | INFO | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist 2023-03-22 11:44:20,585 | INFO | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon) 2023-03-22 11:44:20,586 | INFO | max memory (maxPSS) used by the payload is within the allowed limit: 627933 B (2 * maxRSS = 131072000 B) 2023-03-22 11:44:20,587 | INFO | oom_score(pilot) = 666, oom_score(payload) = 666 2023-03-22 11:44:20,587 | INFO | payload log (log.EVNTtoHITS) within allowed size limit (2147483648 B): 1606 B 2023-03-22 11:44:20,587 | INFO | payload log (payload.stdout) within allowed size limit (2147483648 B): 9445 B 2023-03-22 11:44:20,587 | INFO | executing command: df -mP /home/boinc9/BOINC_TEST/slots/0 2023-03-22 11:44:20,628 | INFO | sufficient remaining disk space (102636716032 B) 2023-03-22 11:44:20,628 | INFO | work directory size check will use 61362667520 B as a max limit (10% grace limit added) 2023-03-22 11:44:20,629 | INFO | size of work directory /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892: 27332 B (within 61362667520 B limit) 2023-03-22 11:44:20,629 | INFO | pfn file=/home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 does not exist (skip from workdir size calculation) 2023-03-22 11:44:20,629 | INFO | total size of present files: 0 B (workdir size: 27332 B) 2023-03-22 11:44:20,629 | INFO | output file size check: skipping output file /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 since it does not exist 2023-03-22 11:44:21,871 | INFO | number of running child processes to parent process 121445: 6 2023-03-22 11:44:21,872 | INFO | maximum number of monitored processes: 6 |
8)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8022)
Posted 22 Mar 2023 by computezrmle Post: ... also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically. This needs to be tested, especially with complex wpad files. Frontier clients/pacparser libs run into problems if the server list gets too long. Please provide some test tasks to check the runtime logs for related Frontier errors/warnings. <edit> If possible, not the 500 event ones. </edit> |
9)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8021)
Posted 22 Mar 2023 by computezrmle Post: Hmm, I really don't understand this. I checked the log of this task and it looks correct: setup.sh.local shows the Frontier setup that is intended to be used, but the setup that is really used is reported in log.EVNTtoHITS. The old ATLAS version reports the Frontier proxies there, like: 07:10:02 DBReplicaSvc INFO Frontier server at (serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)... Do you have access to the corresponding log.EVNTtoHITS from the example task? atlasfrontier-ai.cern.ch would be used as 1st fallback if atlascern-frontier.openhtc.io doesn't respond. Very unlikely that within a couple of days all dev tasks run into fail-over while all prod tasks don't. <edit> Sorry, didn't notice your recent post. </edit> |
10)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8019)
Posted 22 Mar 2023 by computezrmle Post: This is not related to the Frontier issue. |
11)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8014)
Posted 21 Mar 2023 by computezrmle Post: My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid. first and last: [21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT . . . [21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT The request were sent by this task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547 All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested. ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual. |
12)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8009)
Posted 21 Mar 2023 by computezrmle Post: David Cameron wrote: We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor. I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app. It ran fine but had little RAM headroom. If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache: 2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total) |
13)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8005)
Posted 21 Mar 2023 by computezrmle Post: I'm still getting Frontier data from atlasfrontier-ai.cern.ch. |
14)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8004)
Posted 21 Mar 2023 by computezrmle Post: KISS (keep it simple and stupid) There's no need to maintain (and explain) an additional (very long running!) app_version just to pack some more events into a task. |
15)
Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
(Message 8001)
Posted 21 Mar 2023 by computezrmle Post: Not sure if I correctly understand the new RAM setting for ATLAS VMs. Old: RAM is set according to 3000 MB + 900 MB * #cores New: Fixed RAM setting, currently 2241 MB, may become 4000 MB (?) Since the new VM should be able to run either old/new ATLAS version how will the RAM setting be configured? |
16)
Message boards :
ATLAS Application :
vbox console monitoring with 3.01
(Message 7999)
Posted 20 Mar 2023 by computezrmle Post: since we'll have to run both in parallel for some time Did the old version change it's logging behaviour? If not, we would need to keep the old monitoring branch and call it if an old ATLAS version is processed. |
17)
Message boards :
ATLAS Application :
vbox console monitoring with 3.01
(Message 7998)
Posted 20 Mar 2023 by computezrmle Post: CP's screenshots show what I described as "messed" logfile lines. See: "worker 1..." As a result the monitoring script can't extract the runtime (here: 338 s) This leads to the missing values now reported as "N/A". A side effect results in the "flashing text" CP already reported. |
18)
Message boards :
ATLAS Application :
vbox console monitoring with 3.01
(Message 7997)
Posted 20 Mar 2023 by computezrmle Post: I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper. @ David It is not only the search pattern. An additional point is that the old monitoring expects the result lines in order. This was guaranteed in singlecore mode within the main log as well as in multicore mode within each of the worker logs. Now the log entries are no longer in order and this fact has to be respected by the monitoring. Another point that needs to be checked: The timing averages reported by the workers refer to the worker thread they are coming from. Hence, they are not valid to calculate the total average. |
19)
Message boards :
ATLAS Application :
vbox console monitoring with 3.01
(Message 7994)
Posted 20 Mar 2023 by computezrmle Post: Total number of events is displayed (50) Same here |
20)
Message boards :
ATLAS Application :
vbox console monitoring with 3.01
(Message 7993)
Posted 20 Mar 2023 by computezrmle Post: Another 4-core task is in progress. Manually set the VM's RAM size to 3900 MB (the default used for older 1-core VMs). Top now shows 776 kB swap being used. The differencing image uses around 880 MB and grows slowly while the events are being processed. The main python process uses 2.2 GB RAM and close to 400% CPU which corresponds to the 4-core setup. |
©2023 CERN