Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 8 Apr 15 Posts: 779 Credit: 12,149,380 RAC: 2,344 |
Thanks David |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
Have those tasks running, but without squid, because of problems with my own squid. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4639 Only 1 min. CPU used and 35 min Duration. Canceling this task now. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
log.EVNTtoHITS starts but hangs after the last line: 12:16:59 Wed Mar 22 12:16:59 CET 2023 12:16:59 Preloading tcmalloc_minimal.so 12:16:59 Preloading /cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/23.0/AthSimulationExternals/23.0.19/InstallArea/x86_64-centos7-gcc11-opt/lib/libimf.so 12:17:08 Py:Sim_tf INFO ****************** STARTING Simulation ***************** 12:17:08 Py:Sim_tf INFO **** Transformation run arguments 12:17:08 Py:Sim_tf INFO RunArguments: 12:17:08 AMITag = 's4066' 12:17:08 EVNTFileIO = 'input' 12:17:08 concurrentEvents = 4 12:17:08 conditionsTag = 'OFLCOND-MC21-SDR-RUN3-07' 12:17:08 firstEvent = 99501 12:17:08 geometryVersion = 'ATLAS-R3S-2021-03-02-00' 12:17:08 inputEVNTFile = ['EVNT.29838250._000010.pool.root.1'] 12:17:08 inputEVNTFileNentries = 10000 12:17:08 inputEVNTFileType = 'EVNT' 12:17:08 jobNumber = 200 12:17:08 maxEvents = 20 12:17:08 nprocs = 0 12:17:08 outputHITSFile = 'HITS.32413688._000229-2536910-1679482957.pool.root.1' 12:17:08 outputHITSFileType = 'HITS' 12:17:08 perfmon = 'fastmonmt' 12:17:08 postInclude = ['PyJobTransforms.UseFrontier'] 12:17:08 preInclude = ['Campaigns.MC23aSimulationMultipleIoV'] 12:17:08 randomSeed = 200 12:17:08 runNumber = 601229 12:17:08 simulator = 'FullG4MT_QS' 12:17:08 skipEvents = 9500 12:17:08 threads = 4 12:17:08 totalExecutorSteps = 0 12:17:08 trfSubstepName = 'EVNTtoHITS' 12:17:08 Py:Sim_tf INFO **** Setting-up configuration flags pilotlog.txt loops printing lines like those: 2023-03-22 11:44:12,410 | INFO | 1693.4158494472504s have passed since pilot start 2023-03-22 11:44:18,540 | INFO | monitor loop #22: job 0:5753961892 is in state 'running' 2023-03-22 11:44:20,310 | INFO | CPU consumption time for pid=121445: 10.57 (rounded to 11) 2023-03-22 11:44:20,310 | INFO | executing command: ps -opid --no-headers --ppid 121445 2023-03-22 11:44:20,469 | INFO | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist 2023-03-22 11:44:20,469 | INFO | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon) 2023-03-22 11:44:20,470 | INFO | executing command: ps aux -q 109100 2023-03-22 11:44:20,585 | INFO | neither /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_summary.json, nor /home/boinc9/BOINC_TEST/slots/0/memory_monitor_summary.json exist 2023-03-22 11:44:20,585 | INFO | using path: /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/memory_monitor_output.txt (trf name=prmon) 2023-03-22 11:44:20,586 | INFO | max memory (maxPSS) used by the payload is within the allowed limit: 627933 B (2 * maxRSS = 131072000 B) 2023-03-22 11:44:20,587 | INFO | oom_score(pilot) = 666, oom_score(payload) = 666 2023-03-22 11:44:20,587 | INFO | payload log (log.EVNTtoHITS) within allowed size limit (2147483648 B): 1606 B 2023-03-22 11:44:20,587 | INFO | payload log (payload.stdout) within allowed size limit (2147483648 B): 9445 B 2023-03-22 11:44:20,587 | INFO | executing command: df -mP /home/boinc9/BOINC_TEST/slots/0 2023-03-22 11:44:20,628 | INFO | sufficient remaining disk space (102636716032 B) 2023-03-22 11:44:20,628 | INFO | work directory size check will use 61362667520 B as a max limit (10% grace limit added) 2023-03-22 11:44:20,629 | INFO | size of work directory /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892: 27332 B (within 61362667520 B limit) 2023-03-22 11:44:20,629 | INFO | pfn file=/home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 does not exist (skip from workdir size calculation) 2023-03-22 11:44:20,629 | INFO | total size of present files: 0 B (workdir size: 27332 B) 2023-03-22 11:44:20,629 | INFO | output file size check: skipping output file /home/boinc9/BOINC_TEST/slots/0/PanDA_Pilot-5753961892/HITS.32413688._000229-2536910-1679482957.pool.root.1 since it does not exist 2023-03-22 11:44:21,871 | INFO | number of running child processes to parent process 121445: 6 2023-03-22 11:44:21,872 | INFO | maximum number of monitored processes: 6 |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
There is a missing library for handling the wpads: EVNTtoHITS 14:06:46 error [pacparser-dlopen.c:57]: config error: cannot dlopen libpacparser.so.1: cannot open shared object file: No such file or directory I've cancelled all the bad WU in the system, please abort any running tasks. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
One solution for this is to add the library path to LD_LIBRARY_PATH. It must point to the lib version pacparser/pactester is linked to. Example (works for the old ATLAS version): pactester_bin="/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/bin/pactester" export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/pacparser/1.3.5-a65a3/x86_64-centos7-gcc62-opt/lib The export can be made in setup.sh-local for example. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
My ATLAS colleagues told me there is an environment variable to set so that the correct libraries are included when setting up the ATLAS software release. I've set this now ans submitted some new tasks. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Same as described here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=614&postid=8026 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
Seeing 1,08 GByte download in production. Is this new Version transfered from -dev? Now a second download on the same PC with 1,09GByte in production. Application for Atlas in prod is the old one?? In Germany we say "Holland in Not". |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
Seeing 1,08 GByte download in production.My vDSL works at it's limit, but it seems to be overloaded with the new 1.2 GB Atlas-Tasks in Live. Normally I have at worktime a limit for Atlas with 50 MB Download, I have opened it, but it is still not enough :-( |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Seems the wpads need more testing. I've reverted back to the previous settings. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Seeing 1,08 GByte download in production. No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now. Have stopped one Threadripper 3995 overnight. 80 MBit/s from ISP and 1 GBit/s Network (including Squid) running at the limit, since those 1 GByte-Atlas downloads are active. All Atlas-Tasks finishing with Hits-File so long. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
No, it's still the old version in prod. It must be just new batches of tasks with large files. I'll ask the submitters why it's like this now. David, any news on this ? Meanwhile I have stopped all Atlas-Downloads, these 1,2 GB for each WU are too much, since yesterday evening I have downloaded 0,4 Terrabyte only from Atlas-Servers |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
Yeti, it was a scientist during the generating of the tasks. It was a file not for Boinc. You find the answer from CP in -prod. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
Yeti, Yeah, I had already read it, but someone should tell a word about when it will be fixed, will WUs with this mistake get cancelled from sending out or whatever I posted it here again because I didn't want to fokus the whole community on this point |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
At the end of this day 500 GByte receiving... with 6 hours interrupt for one Threadripper 3995. Tomorrow.....? |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 565 |
The last two days, 600 GByte Total (60 Upload/540 Download) every day. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 25 |
I looked in details at the logs of this task and it looked like it took 8 or 9 minutes to get going full steam. But still it could be that the initialisation phase is indeed longer in this new software.Much longer init phase on my fastest cruncher. Now it's far over 30 minutes. In production the workers are on full steam after 6 minutes. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I was told that will likely run out of Run 2 simulation tasks to run on the prod project very soon, so I have gone ahead and released version 3 there so we can start running Run 3 tasks. Unfortunately I don't think we'll be able to resolve some of the remaining issues like the console monitoring before going live on prod but I think it's better to have something not quite perfect than no tasks at all. |
Send message Joined: 22 Aug 22 Posts: 22 Credit: 63,680 RAC: 203 |
I see this in pilotlog.txt 2023-04-13 16:18:19,169 | WARNING | Failed to initialize SSL context .. skipped, error: certfile should be a valid filesystem path 2023-04-13 16:18:19,266 | WARNING | cache file=/var/lib/boinc-client/slots/0/cric_pandaqueues.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/0/cric_pandaqueues.json' .. skipped 2023-04-13 16:18:19,266 | INFO | [attempt=1/1] loading data from file=/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json 2023-04-13 16:18:20,792 | INFO | saved data from "/cvmfs/atlas.cern.ch/repo/sw/local/etc/cric_pandaqueues.json" resource into file=/var/lib/boinc-client/slots/0/agis_schedconf.cvmfs.json, length=989.5Kb 2023-04-13 16:18:20,804 | INFO | queuedata: following keys will be overwritten by config values: {'maxwdir_broken': '14336 MB', 'es_stageout_gap': 601} 2023-04-13 16:18:20,806 | INFO | [attempt=1/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json 2023-04-13 16:18:21,190 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json 2023-04-13 16:18:21,190 | INFO | will try again after 23s.. 2023-04-13 16:18:44,288 | INFO | [attempt=2/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json 2023-04-13 16:18:44,424 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json 2023-04-13 16:18:44,424 | INFO | will try again after 15s.. 2023-04-13 16:18:59,474 | INFO | [attempt=3/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json 2023-04-13 16:18:59,609 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in cert ificate chain (_ssl.c:1129)> .. trying to use data from cache=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json 2023-04-13 16:18:59,609 | WARNING | cache file=/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc-client/slots/0/agis_ddmendpoints.agis.ALL.json' .. skipped |
©2024 CERN