Message boards : ATLAS Application : ATLAS vbox and native 3.01
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 760
Credit: 11,789,486
RAC: 2,413
Message 8024 - Posted: 22 Mar 2023, 11:06:18 UTC - in response to Message 8023.  

Thanks David
ID: 8024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8025 - Posted: 22 Mar 2023, 11:32:04 UTC - in response to Message 8023.  
Last modified: 22 Mar 2023, 12:03:14 UTC

Have those tasks running, but without squid, because of problems with my own squid.
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4639

Only 1 min. CPU used and 35 min Duration. Canceling this task now.
ID: 8025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 479
Credit: 394,720
RAC: 24
Message 8026 - Posted: 22 Mar 2023, 11:47:28 UTC - in response to Message 8023.  

log.EVNTtoHITS starts but hangs after the last line:

12:16:59 Wed Mar 22 12:16:59 CET 2023
12:16:59 Preloading tcmalloc_minimal.so
12:16:59 Preloading /cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libimf.so
12:17:08 Py:Sim_tf            INFO ****************** STARTING Simulation *****************
12:17:08 Py:Sim_tf            INFO **** Transformation run arguments
12:17:08 Py:Sim_tf            INFO RunArguments:
12:17:08    AMITag = 's4066'
12:17:08    EVNTFileIO = 'input'
12:17:08    concurrentEvents = 4
12:17:08    conditionsTag = 'OFLCOND-MC21-SDR-RUN3-07'
12:17:08    firstEvent = 99501
12:17:08    geometryVersion = 'ATLAS-R3S-2021-03-02-00'
12:17:08    inputEVNTFile = ['EVNT.29838250._000010.pool.root.1']
12:17:08    inputEVNTFileNentries = 10000
12:17:08    inputEVNTFileType = 'EVNT'
12:17:08    jobNumber = 200
12:17:08    maxEvents = 20
12:17:08    nprocs = 0
12:17:08    outputHITSFile = 'HITS.32413688._000229-2536910-1679482957.pool.root.1'
12:17:08    outputHITSFileType = 'HITS'
12:17:08    perfmon = 'fastmonmt'
12:17:08    postInclude = ['PyJobTransforms.UseFrontier']
12:17:08    preInclude = ['Campaigns.MC23aSimulationMultipleIoV']
12:17:08    randomSeed = 200
12:17:08    runNumber = 601229
12:17:08    simulator = 'FullG4MT_QS'
12:17:08    skipEvents = 9500
12:17:08    threads = 4
12:17:08    totalExecutorSteps = 0
12:17:08    trfSubstepName = 'EVNTtoHITS'
12:17:08 Py:Sim_tf            INFO **** Setting-up configuration flags






pilotlog.txt loops printing lines like those:

2023-03-22 11:44:12,410 | INFO     | 1693.4158494472504s have passed since pilot start
2023-03-22 11:44:18,540 | INFO     | monitor loop #22: job 0:5753961892 is in state 'running'
2023-03-22 11:44:20,310 | INFO     | CPU consumption time for pid=121445: 10.57 (rounded to 11)
2023-03-22 11:44:20,310 | INFO     | executing command: ps -opid --no-headers --ppid 121445
2023-03-22 11:44:20,469 | INFO     | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist
2023-03-22 11:44:20,469 | INFO     | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon)
2023-03-22 11:44:20,470 | INFO     | executing command: ps aux -q 109100
2023-03-22 11:44:20,585 | INFO     | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist
2023-03-22 11:44:20,585 | INFO     | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon)
2023-03-22 11:44:20,586 | INFO     | max memory (maxPSS) used by the payload is within the allowed limit: 627933 B (2 * maxRSS = 131072000 B)
2023-03-22 11:44:20,587 | INFO     | oom_score(pilot) = 666, oom_score(payload) = 666
2023-03-22 11:44:20,587 | INFO     | payload log (log.EVNTtoHITS) within allowed size limit (2147483648 B): 1606 B
2023-03-22 11:44:20,587 | INFO     | payload log (payload.stdout) within allowed size limit (2147483648 B): 9445 B
2023-03-22 11:44:20,587 | INFO     | executing command: df -mP /home/boinc9/BOINC_TEST/slots/0
2023-03-22 11:44:20,628 | INFO     | sufficient remaining disk space (102636716032 B)
2023-03-22 11:44:20,628 | INFO     | work directory size check will use 61362667520 B as a max limit (10% grace limit added)
2023-03-22 11:44:20,629 | INFO     | size of work directory /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892: 27332 B (within 61362667520 B limit)
2023-03-22 11:44:20,629 | INFO     | pfn file=/home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 does not exist (skip from workdir size calculation)
2023-03-22 11:44:20,629 | INFO     | total size of present files: 0 B (workdir size: 27332 B)
2023-03-22 11:44:20,629 | INFO     | output file size check: skipping output file /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 since it does not exist
2023-03-22 11:44:21,871 | INFO     | number of running child processes to parent process 121445: 6
2023-03-22 11:44:21,872 | INFO     | maximum number of monitored processes: 6
ID: 8026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8027 - Posted: 22 Mar 2023, 13:13:44 UTC

There is a missing library for handling the wpads:

EVNTtoHITS 14:06:46 error [pacparser-dlopen.c:57]: config error: cannot dlopen libpacparser.so.1: cannot open shared object file: No such file or directory

I've cancelled all the bad WU in the system, please abort any running tasks.
ID: 8027 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 479
Credit: 394,720
RAC: 24
Message 8028 - Posted: 22 Mar 2023, 13:37:13 UTC - in response to Message 8027.  

One solution for this is to add the library path to LD_LIBRARY_PATH.
It must point to the lib version pacparser/pactester is linked to.

Example (works for the old ATLAS version):

pactester_bin="/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/bin/pactester"

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/lib


The export can be made in setup.sh-local for example.
ID: 8028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8030 - Posted: 22 Mar 2023, 15:26:45 UTC - in response to Message 8028.  

My ATLAS colleagues told me there is an environment variable to set so that the correct libraries are included when setting up the ATLAS software release. I've set this now ans submitted some new tasks.
ID: 8030 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 479
Credit: 394,720
RAC: 24
Message 8031 - Posted: 22 Mar 2023, 15:46:36 UTC - in response to Message 8030.  

ID: 8031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8032 - Posted: 22 Mar 2023, 16:11:28 UTC - in response to Message 8030.  
Last modified: 22 Mar 2023, 16:16:17 UTC

Seeing 1,08 GByte download in production.
Is this new Version transfered from -dev?
Now a second download on the same PC with 1,09GByte in production.
Application for Atlas in prod is the old one??
In Germany we say "Holland in Not".
ID: 8032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 8033 - Posted: 22 Mar 2023, 17:55:19 UTC - in response to Message 8032.  

Seeing 1,08 GByte download in production.
Is this new Version transfered from -dev?
Now a second download on the same PC with 1,09GByte in production.
Application for Atlas in prod is the old one??
In Germany we say "Holland in Not".
My vDSL works at it's limit, but it seems to be overloaded with the new 1.2 GB Atlas-Tasks in Live.



Normally I have at worktime a limit for Atlas with 50 MB Download, I have opened it, but it is still not enough :-(
ID: 8033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8035 - Posted: 22 Mar 2023, 19:50:42 UTC - in response to Message 8031.  

Seems the wpads need more testing. I've reverted back to the previous settings.
ID: 8035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8036 - Posted: 22 Mar 2023, 19:52:41 UTC - in response to Message 8032.  

Seeing 1,08 GByte download in production.
Is this new Version transfered from -dev?
Now a second download on the same PC with 1,09GByte in production.
Application for Atlas in prod is the old one??
In Germany we say "Holland in Not".


No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now.
ID: 8036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8037 - Posted: 23 Mar 2023, 5:04:54 UTC - in response to Message 8036.  

No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now.

Have stopped one Threadripper 3995 overnight.
80 MBit/s from ISP and 1 GBit/s Network (including Squid) running at the limit, since those 1 GByte-Atlas downloads are active.
All Atlas-Tasks finishing with Hits-File so long.
ID: 8037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 8038 - Posted: 23 Mar 2023, 12:26:11 UTC - in response to Message 8037.  
Last modified: 23 Mar 2023, 12:36:44 UTC

No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now.

Have stopped one Threadripper 3995 overnight.
80 MBit/s from ISP and 1 GBit/s Network (including Squid) running at the limit, since those 1 GByte-Atlas downloads are active.
All Atlas-Tasks finishing with Hits-File so long.

David, any news on this ?

Meanwhile I have stopped all Atlas-Downloads, these 1,2 GB for each WU are too much, since yesterday evening I have downloaded 0,4 Terrabyte only from Atlas-Servers
ID: 8038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8039 - Posted: 23 Mar 2023, 14:57:08 UTC - in response to Message 8038.  

Yeti,
it was a scientist during the generating of the tasks. It was a file not for Boinc.
You find the answer from CP in -prod.
ID: 8039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 8040 - Posted: 23 Mar 2023, 15:09:14 UTC - in response to Message 8039.  

Yeti,
it was a scientist during the generating of the tasks. It was a file not for Boinc.
You find the answer from CP in -prod.

Yeah, I had already read it, but someone should tell a word about when it will be fixed, will WUs with this mistake get cancelled from sending out or whatever

I posted it here again because I didn't want to fokus the whole community on this point
ID: 8040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8041 - Posted: 23 Mar 2023, 16:32:20 UTC - in response to Message 8040.  
Last modified: 23 Mar 2023, 16:32:36 UTC

At the end of this day 500 GByte receiving... with 6 hours interrupt for one Threadripper 3995.
Tomorrow.....?
ID: 8041 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 673
Credit: 1,922,877
RAC: 1,044
Message 8045 - Posted: 25 Mar 2023, 7:12:03 UTC - in response to Message 8041.  

The last two days, 600 GByte Total (60 Upload/540 Download) every day.
ID: 8045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 141
Message 8046 - Posted: 25 Mar 2023, 10:06:57 UTC - in response to Message 7968.  

I looked in details at the logs of this task and it looked like it took 8 or 9 minutes to get going full steam. But still it could be that the initialisation phase is indeed longer in this new software.
Much longer init phase on my fastest cruncher.
Now it's far over 30 minutes. In production the workers are on full steam after 6 minutes.
ID: 8046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 8048 - Posted: 28 Mar 2023, 11:19:48 UTC

I was told that will likely run out of Run 2 simulation tasks to run on the prod project very soon, so I have gone ahead and released version 3 there so we can start running Run 3 tasks.

Unfortunately I don't think we'll be able to resolve some of the remaining issues like the console monitoring before going live on prod but I think it's better to have something not quite perfect than no tasks at all.
ID: 8048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Aug 22
Posts: 22
Credit: 46,639
RAC: 21
Message 8057 - Posted: 13 Apr 2023, 16:22:03 UTC

I see this in pilotlog.txt
2023-04-13 16:18:19,169 | WARNING | Failed to initialize SSL context .. skipped, error: certfile should be a valid filesystem path
2023-04-13 16:18:19,266 | WARNING | cache file=/var/lib/boinc-client/slots/0/cric_pandaqueues.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/0/cric_pandaqueues.json' .. skipped
2023-04-13 16:18:19,266 | INFO | [attempt=1/1] loading data from file=/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json
2023-04-13 16:18:20,792 | INFO | saved data from "/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json" resource into file=/var/lib/boinc-client/slots/0/agis_schedconf.cvmfs.json, length=989.5Kb
2023-04-13 16:18:20,804 | INFO | queuedata: following keys will be overwritten by config values: {'maxwdir_broken': '14336 MB', 'es_stageout_gap': 601}
2023-04-13 16:18:20,806 | INFO | [attempt=1/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json
2023-04-13 16:18:21,190 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert
ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json
2023-04-13 16:18:21,190 | INFO | will try again after 23s..
2023-04-13 16:18:44,288 | INFO | [attempt=2/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json
2023-04-13 16:18:44,424 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert
ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json
2023-04-13 16:18:44,424 | INFO | will try again after 15s..
2023-04-13 16:18:59,474 | INFO | [attempt=3/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json
2023-04-13 16:18:59,609 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert
ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json
2023-04-13 16:18:59,609 | WARNING | cache file=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json' ..
skipped
ID: 8057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : ATLAS Application : ATLAS vbox and native 3.01


©2024 CERN