41) Message boards : Theory Application : Why is there no native_theory on this project? (Message 8069)
Posted 28 Apr 2023 by computezrmle
Post:
There's currently no need.
42) Message boards : ATLAS Application : Sometimes atlas jobs use more cpu and memory than required. (Message 8064)
Posted 27 Apr 2023 by computezrmle
Post:
Unfortunately David Cameron left CERN to start a new job.
Among the last tests he made here were modifications to find out optimized (RAM) settings for ATLAS v3.x.
Might be those tests are not yet completed and need to be finished when team has found a successor.

Beside that you mentioned ATLAS native but linked to ATLAS vbox tasks.
43) Message boards : ATLAS Application : SSL certificate error in atlas tasks (Message 8063)
Posted 27 Apr 2023 by computezrmle
Post:
The same suggestion came up 2 weeks ago at the -prod forum.
I just replied to the post there:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5986&postid=48045
44) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8031)
Posted 22 Mar 2023 by computezrmle
Post:
Same as described here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=614&postid=8026
45) Message boards : ATLAS Application : Huge EVNT Files (Message 8029)
Posted 22 Mar 2023 by computezrmle
Post:
[22/Mar/2023:15:11:31 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//217/bM6LDmhcBz2n9Rq4apoT9bVoABFKDmABFKDmSrVUDm0NXKDmZhlDEn_EVNT.32794564._000002.pool.root.1 HTTP/1.1" 200 1166414003 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT

[22/Mar/2023:15:27:33 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//1be/MBeLDmLdBz2nsSi4apGgGQJmABFKDmABFKDmjv4TDmjwcKDmTB0Dkm_EVNT.32794564._000002.pool.root.1 HTTP/1.1" 200 1166414003 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT


Just noticed those huge ATLAS EVNT files being downloaded from prod to different clients:
1,166,414,003 => 1.2 GB each


<edit>
Next one:
[22/Mar/2023:15:52:53 +0100] "GET http://lhcathome-upload.cern.ch/lhcathome/download//1a4/eP1NDmDOCz2np2BDcpmwOghnABFKDmABFKDmtdFKDmY3TKDmyMPKRn_EVNT.32794564._000003.pool.root.1 HTTP/1.1" 200 1164691364 "-" "BOINC client (x86_64-suse-linux-gnu 7.21.0)" TCP_MISS:HIER_DIRECT

</edit>
46) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8028)
Posted 22 Mar 2023 by computezrmle
Post:
One solution for this is to add the library path to LD_LIBRARY_PATH.
It must point to the lib version pacparser/pactester is linked to.

Example (works for the old ATLAS version):

pactester_bin="/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/bin/pactester"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/lib


The export can be made in setup.sh-local for example.
47) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8026)
Posted 22 Mar 2023 by computezrmle
Post:
log.EVNTtoHITS starts but hangs after the last line:

12:16:59 Wed Mar 22 12:16:59 CET 2023
12:16:59 Preloading tcmalloc_minimal.so
12:16:59 Preloading /cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libimf.so
12:17:08 Py:Sim_tf            INFO ****************** STARTING Simulation *****************
12:17:08 Py:Sim_tf            INFO **** Transformation run arguments
12:17:08 Py:Sim_tf            INFO RunArguments:
12:17:08    AMITag = 's4066'
12:17:08    EVNTFileIO = 'input'
12:17:08    concurrentEvents = 4
12:17:08    conditionsTag = 'OFLCOND-MC21-SDR-RUN3-07'
12:17:08    firstEvent = 99501
12:17:08    geometryVersion = 'ATLAS-R3S-2021-03-02-00'
12:17:08    inputEVNTFile = ['EVNT.29838250._000010.pool.root.1']
12:17:08    inputEVNTFileNentries = 10000
12:17:08    inputEVNTFileType = 'EVNT'
12:17:08    jobNumber = 200
12:17:08    maxEvents = 20
12:17:08    nprocs = 0
12:17:08    outputHITSFile = 'HITS.32413688._000229-2536910-1679482957.pool.root.1'
12:17:08    outputHITSFileType = 'HITS'
12:17:08    perfmon = 'fastmonmt'
12:17:08    postInclude = ['PyJobTransforms.UseFrontier']
12:17:08    preInclude = ['Campaigns.MC23aSimulationMultipleIoV']
12:17:08    randomSeed = 200
12:17:08    runNumber = 601229
12:17:08    simulator = 'FullG4MT_QS'
12:17:08    skipEvents = 9500
12:17:08    threads = 4
12:17:08    totalExecutorSteps = 0
12:17:08    trfSubstepName = 'EVNTtoHITS'
12:17:08 Py:Sim_tf            INFO **** Setting-up configuration flags






pilotlog.txt loops printing lines like those:

2023-03-22 11:44:12,410 | INFO     | 1693.4158494472504s have passed since pilot start
2023-03-22 11:44:18,540 | INFO     | monitor loop #22: job 0:5753961892 is in state 'running'
2023-03-22 11:44:20,310 | INFO     | CPU consumption time for pid=121445: 10.57 (rounded to 11)
2023-03-22 11:44:20,310 | INFO     | executing command: ps -opid --no-headers --ppid 121445
2023-03-22 11:44:20,469 | INFO     | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist
2023-03-22 11:44:20,469 | INFO     | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon)
2023-03-22 11:44:20,470 | INFO     | executing command: ps aux -q 109100
2023-03-22 11:44:20,585 | INFO     | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist
2023-03-22 11:44:20,585 | INFO     | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon)
2023-03-22 11:44:20,586 | INFO     | max memory (maxPSS) used by the payload is within the allowed limit: 627933 B (2 * maxRSS = 131072000 B)
2023-03-22 11:44:20,587 | INFO     | oom_score(pilot) = 666, oom_score(payload) = 666
2023-03-22 11:44:20,587 | INFO     | payload log (log.EVNTtoHITS) within allowed size limit (2147483648 B): 1606 B
2023-03-22 11:44:20,587 | INFO     | payload log (payload.stdout) within allowed size limit (2147483648 B): 9445 B
2023-03-22 11:44:20,587 | INFO     | executing command: df -mP /home/boinc9/BOINC_TEST/slots/0
2023-03-22 11:44:20,628 | INFO     | sufficient remaining disk space (102636716032 B)
2023-03-22 11:44:20,628 | INFO     | work directory size check will use 61362667520 B as a max limit (10% grace limit added)
2023-03-22 11:44:20,629 | INFO     | size of work directory /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892: 27332 B (within 61362667520 B limit)
2023-03-22 11:44:20,629 | INFO     | pfn file=/home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 does not exist (skip from workdir size calculation)
2023-03-22 11:44:20,629 | INFO     | total size of present files: 0 B (workdir size: 27332 B)
2023-03-22 11:44:20,629 | INFO     | output file size check: skipping output file /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 since it does not exist
2023-03-22 11:44:21,871 | INFO     | number of running child processes to parent process 121445: 6
2023-03-22 11:44:21,872 | INFO     | maximum number of monitored processes: 6
48) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8022)
Posted 22 Mar 2023 by computezrmle
Post:
... also setup the squid auto-discovery (Web Proxy Auto Detection (WPAD)) which should find the best squid server to use automatically.

This needs to be tested, especially with complex wpad files.
Frontier clients/pacparser libs run into problems if the server list gets too long.

Please provide some test tasks to check the runtime logs for related Frontier errors/warnings.

<edit>
If possible, not the 500 event ones.
</edit>
49) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8021)
Posted 22 Mar 2023 by computezrmle
Post:
Hmm, I really don't understand this. I checked the log of this task and it looks correct:

2023-03-21 09:54:07,695 [wrapper] Content of /home/atlas/RunAtlas//setup.sh.local
export FRONTIER_SERVER="(serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(proxyurl=http://on....

and this configuration is the same in dev and prod.


setup.sh.local shows the Frontier setup that is intended to be used, but the setup that is really used is reported in log.EVNTtoHITS.
The old ATLAS version reports the Frontier proxies there, like:
07:10:02 DBReplicaSvc         INFO Frontier server at (serverurl=http://atlascern-frontier.openhtc.io:8080/atlr)...

Do you have access to the corresponding log.EVNTtoHITS from the example task?


atlasfrontier-ai.cern.ch would be used as 1st fallback if atlascern-frontier.openhtc.io doesn't respond.
Very unlikely that within a couple of days all dev tasks run into fail-over while all prod tasks don't.

<edit>
Sorry, didn't notice your recent post.
</edit>
50) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8019)
Posted 22 Mar 2023 by computezrmle
Post:
This is not related to the Frontier issue.
51) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8014)
Posted 21 Mar 2023 by computezrmle
Post:
My testclient running ATLAS tasks from -dev today made 0 requests to atlascern-frontier.openhtc.io but 3225 requests to atlasfrontier-ai.cern.ch via my local Squid.
first and last:
[21/Mar/2023:10:56:45 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RCHMNCvb091NwC-L3VQgI8ncJdQ6Jd-b3DfD3c-ULiYdJh3u4BrnC5BV8PL1dFdT9ixKTc1IVXBJLEpMSi1NV1QHuABhy HTTP/1.0" 200 1256 "-" "-" TCP_REFRESH_MODIFIED:HIER_DIRECT
.
.
.
[21/Mar/2023:11:00:20 +0100] "GET http://atlasfrontier-ai.cern.ch:8000/atlr/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNqNkL0KwkAQhF9luSZNCCoSsEixZtfLwf2EvRVJlfd-C4NahJxoypn5ZorJ7LlX4MD9iIIhz4SKzUbPjuqCGQcnSVFdimWY0VoXbRm4GFmyx6gvtwTSXdcA3CQFMKgeM5FpzIY3UDj1D-ykaMvK230MLAxfovUdi1zegK46Htr2fKkAI.24r-sz.8EgCbHAddox.QQHvXj6 HTTP/1.0" 200 1517 "-" "-" TCP_REFRESH_UNMODIFIED:HIER_DIRECT

The request were sent by this task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3195547
All my ATLAS tests (native/vbox) within the last few days used the standard files sent by the project server after a project reset, except the local app_config.xml where I configured the RAM settings to be tested.


ATLAS tasks from -prod use atlascern-frontier.openhtc.io as usual.
52) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8009)
Posted 21 Mar 2023 by computezrmle
Post:
David Cameron wrote:
We will use the old setting until we stop running the old version. Then, a constant setting can be used since the memory usage of the new software is independent of number of cores. On average it uses around 2.5GB so we'll probably set something like 3.5GB to have a safety factor.

I just ran another VM (3.5 GB) that used 2.2 GB for the scientific app.
It ran fine but had little RAM headroom.

If 2.5 GB is the expected average for the scientific app I'd like to suggest 4 GB for the VM to leave enough headroom for the OS and the VM internal page cache:
2.5 GB (scientific app) + 0.6 GB (OS) + 0.9 GB (page cache + small headroom) = 4 GB (total)
53) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8005)
Posted 21 Mar 2023 by computezrmle
Post:
I'm still getting Frontier data from atlasfrontier-ai.cern.ch.
54) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8004)
Posted 21 Mar 2023 by computezrmle
Post:
KISS (keep it simple and stupid)
There's no need to maintain (and explain) an additional (very long running!) app_version just to pack some more events into a task.
55) Message boards : ATLAS Application : ATLAS vbox and native 3.01 (Message 8001)
Posted 21 Mar 2023 by computezrmle
Post:
Not sure if I correctly understand the new RAM setting for ATLAS VMs.

Old:
RAM is set according to 3000 MB + 900 MB * #cores

New:
Fixed RAM setting, currently 2241 MB, may become 4000 MB (?)

Since the new VM should be able to run either old/new ATLAS version how will the RAM setting be configured?
56) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7999)
Posted 20 Mar 2023 by computezrmle
Post:
since we'll have to run both in parallel for some time

Did the old version change it's logging behaviour?
If not, we would need to keep the old monitoring branch and call it if an old ATLAS version is processed.
57) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7998)
Posted 20 Mar 2023 by computezrmle
Post:
CP's screenshots show what I described as "messed" logfile lines.
See: "worker 1..."

As a result the monitoring script can't extract the runtime (here: 338 s)
This leads to the missing values now reported as "N/A".
A side effect results in the "flashing text" CP already reported.
58) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7997)
Posted 20 Mar 2023 by computezrmle
Post:
I thought that the new tasks would all behave the same as old single-core tasks, writing to a single file. I changed the code searching for the event times to handle the old and new message format, but maybe something else needs changed. I will look deeper.

@ David
It is not only the search pattern.
An additional point is that the old monitoring expects the result lines in order.
This was guaranteed in singlecore mode within the main log as well as in multicore mode within each of the worker logs.

Now the log entries are no longer in order and this fact has to be respected by the monitoring.

Another point that needs to be checked:
The timing averages reported by the workers refer to the worker thread they are coming from.
Hence, they are not valid to calculate the total average.
59) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7994)
Posted 20 Mar 2023 by computezrmle
Post:
Total number of events is displayed (50)
already finished, mean, min, max, estimated time left is not displayed and every 60 seconds several lines of text are flashing (not readable) and cleared.
Worker 1 Event showing showed 2nd, 4th and then 3th event ?? for this worker took ### s (### is changing now and then e.g. 421)


Edit picture and no swap used
Flashing text:...

Same here
60) Message boards : ATLAS Application : vbox console monitoring with 3.01 (Message 7993)
Posted 20 Mar 2023 by computezrmle
Post:
Another 4-core task is in progress.
Manually set the VM's RAM size to 3900 MB (the default used for older 1-core VMs).

Top now shows 776 kB swap being used.
The differencing image uses around 880 MB and grows slowly while the events are being processed.
The main python process uses 2.2 GB RAM and close to 400% CPU which corresponds to the 4-core setup.


Previous 20 · Next 20


©2024 CERN