Thread 'ATLAS vbox and native 3.01'

Author	Message
David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7946 - Posted: 13 Mar 2023, 10:29:08 UTC v3.01 was just released. This contains updated ATLAS software for the latest version of Run 3 simulation. ID: 7946 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7947 - Posted: 13 Mar 2023, 11:14:28 UTC - in response to Message 7946. First Task Win10pro finished: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193237 ID: 7947 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7948 - Posted: 13 Mar 2023, 11:26:28 UTC I've started submitting tasks with the Run 3 software now - these are tasks with input file EVNT.29838250._000010.pool.root.1. Please post if you see any problems or notice anything strange, for example in the event monitor or other places. ID: 7948 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7949 - Posted: 13 Mar 2023, 19:25:33 UTC - in response to Message 7948. Last modified: 13 Mar 2023, 19:26:36 UTC Atlas-native:https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193330 [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 \| WARNING \| making sure that job.state is set to failed since a pilot error code is set [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 \| WARNING \| format EVNTtoHITS has no such key: dbData [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 \| WARNING \| format EVNTtoHITS has no such key: dbTime [2023-03-13 19:35:44] 2023-03-13 18:35:18,762 \| WARNING \| wrong length of table data, x=[1678732251.0], y=[623076.0] (must be same and length>=4) [2023-03-13 19:35:44] 2023-03-13 18:35:18,762 \| INFO \| payload/TRF did not report the number of read events ID: 7949 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,569,023 RAC: 20,596	Message 7950 - Posted: 14 Mar 2023, 10:52:30 UTC Thanks for always being with us over the years David Hope all is great at your new job ID: 7950 · Rating: 0 · rate: / Reply Quote

boboviz Send message Joined: 24 Oct 19 Posts: 330 Credit: 1,142,372 RAC: 1,203	Message 7951 - Posted: 14 Mar 2023, 14:18:24 UTC I see 99 Atlas wus queued, but if i try to download the message is "got 0 new task" ID: 7951 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7952 - Posted: 14 Mar 2023, 15:33:42 UTC Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version. ID: 7952 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 165 Credit: 3,681,653 RAC: 5,994	Message 7953 - Posted: 14 Mar 2023, 15:58:27 UTC Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low: I'm running Ubuntu 22.04.x Oh, I see, results have got uploaded now, both say "Hits file was produced successfull": Not shure, if this is really true https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193657 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193668 ID: 7953 · Rating: 0 · rate: / Reply Quote

boboviz Send message Joined: 24 Oct 19 Posts: 330 Credit: 1,142,372 RAC: 1,203	Message 7954 - Posted: 14 Mar 2023, 17:11:59 UTC - in response to Message 7952. Last modified: 14 Mar 2023, 17:14:24 UTC Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version. That's true. I downloaded correctly native app in my linux box.[/quote] ID: 7954 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7955 - Posted: 14 Mar 2023, 18:27:06 UTC - in response to Message 7953. Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low: I'm running Ubuntu 22.04.x Oh, I see, results have got uploaded now, both say "Hits file was produced successfull": Not shure, if this is really true https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193657 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193668 The tasks are successful, but the run time is quite large compared to the CPU time. These are short tasks simulating 5 events each, I will submit some longer ones with 50 events to see the difference. I think the vbox tasks are also working ok now. ID: 7955 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,569,023 RAC: 20,596	Message 7956 - Posted: 14 Mar 2023, 20:05:24 UTC I'm ready for some long Windows vbox version work. ID: 7956 · Rating: 0 · rate: / Reply Quote

boboviz Send message Joined: 24 Oct 19 Posts: 330 Credit: 1,142,372 RAC: 1,203	Message 7957 - Posted: 14 Mar 2023, 20:28:42 UTC - in response to Message 7955. I think the vbox tasks are also working ok now. Not for windows 14/03/2023 21:26:23 \| lhcathome-dev \| Requesting new tasks for CPU and AMD/ATI GPU 14/03/2023 21:26:24 \| lhcathome-dev \| Scheduler request completed: got 0 new tasks 14/03/2023 21:26:24 \| lhcathome-dev \| No tasks sent 14/03/2023 21:26:24 \| lhcathome-dev \| No tasks are available for ATLAS Simulation ID: 7957 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7958 - Posted: 14 Mar 2023, 20:56:22 UTC - in response to Message 7957. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193785 Have five in the pipeline, runtime always 50 min (5 CPU's) Win10pro. ID: 7958 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7959 - Posted: 15 Mar 2023, 8:45:03 UTC I'm submitting some longer tasks now (500 events). ID: 7959 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7960 - Posted: 15 Mar 2023, 8:58:26 UTC - in response to Message 7959. First one is running:https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2287426 ID: 7960 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7961 - Posted: 15 Mar 2023, 9:43:18 UTC Findings: 1. On a fully loaded Threadripper it takes ~20 min to complete the initial setup. This is mainly caused by lots of downloads from CVMFS and Frontier. The long setup phase has already been mentioned by other testers. 2. Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io 3. Looks like worker threads do not log their progress into separate logfiles any more. This means ATLAS event monitoring (vbox app) will not work any more. ID: 7961 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7962 - Posted: 15 Mar 2023, 10:04:28 UTC - in response to Message 7960. Atlas Event Progress Monitoring(v.4.1.0) show 500 events total, but no other details. ID: 7962 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 7963 - Posted: 15 Mar 2023, 10:04:41 UTC - in response to Message 7961. Findings: 1. On a fully loaded Threadripper it takes ~20 min to complete the initial setup. This is mainly caused by lots of downloads from CVMFS and Frontier. The long setup phase has already been mentioned by other testers. A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching. 2. Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io Right, this is an error on my part, I will fix it. 3. Looks like worker threads do not log their progress into separate logfiles any more. This means ATLAS event monitoring (vbox app) will not work any more. This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested) This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now? ID: 7963 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7964 - Posted: 15 Mar 2023, 10:31:10 UTC - in response to Message 7963. ] Findings: 1. On a fully loaded Threadripper it takes ~20 min to complete the initial setup. This is mainly caused by lots of downloads from CVMFS and Frontier. The long setup phase has already been mentioned by other testers. A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.[/quote] To test this and ensure comparable total load I'll run another task when the 1st one has finished. 2. Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io Right, this is an error on my part, I will fix it. 3. Looks like worker threads do not log their progress into separate logfiles any more. This means ATLAS event monitoring (vbox app) will not work any more. This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested) This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now? Monitoring needs to be revised. The old ATLAS version wrote it's log information into a couple of files. In case of a singlecore setup only "log.EVNTtoHITS" was used. In case of multicore "log.EVNTtoHITS" was used for global information like the #events but the event status was logged to the worker logs. Now its all in "log.EVNTtoHITS" but the output patterns are not consistent, e.g.: [pre]11:18:13 ISF_Kernel_FullG4MT_QS.ISF_LongLivedGeant4Tool 108 2 INFO Run:Event 410000:99609 (29th event for this worker) took 239.9 s. New average 450.6 +- 44.52 . . . 11:18:14 AthenaHiveEventLoopMgr 108 2 INFO ===>>> done processing event #99609, run #410000 on slot 2, 106 events processed so far <<<===[/pre] ID: 7964 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 165 Credit: 3,681,653 RAC: 5,994	Message 7965 - Posted: 15 Mar 2023, 10:46:14 UTC - in response to Message 7963. A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching. I can not confirm this expected behaviour. First: Remember, I have set up a central Squid-Proxy that works fine. You can check all my active clients, they are all running one DEV-Task after another. So, I would expect that the second or third Task would need less startup time, but they all need the same time. Example: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4703 In the log-file I see still these warnings: [2023-03-14 18:19:53] 2023-03-14 17:19:10,304 \| INFO \| [attempt=2/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json [2023-03-14 18:19:53] 2023-03-14 17:19:10,396 \| WARNING \| failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: [2023-03-14 18:19:53] 2023-03-14 17:19:10,397 \| INFO \| will try again after 18s.. [2023-03-14 18:19:53] 2023-03-14 17:19:28,484 \| INFO \| [attempt=3/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json [2023-03-14 18:19:53] 2023-03-14 17:19:28,571 \| WARNING \| failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: [2023-03-14 18:19:53] 2023-03-14 17:19:28,572 \| WARNING \| cache file=/var/lib/boinc_data/boinc-01/slots/2/agis_ddmendpoints.agis.ALL.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc_data/boinc-0 [2023-03-14 18:19:53] 2023-03-14 17:19:28,606 \| INFO \| transferring file log.32413688._000229-57598-1678802947.job.log.tgz.1 from /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1 [2023-03-14 18:19:53] 2023-03-14 17:19:28,606 \| INFO \| executing command: /usr/bin/env mv /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1678802947.job.log.tgz.1 /var/lib/boinc_d [2023-03-14 18:19:53] 2023-03-14 17:19:28,627 \| INFO \| Adding to output.list: log.32413688._000229-57598-1678802947.job.log.tgz.1 davs://dav.ndgf.org:443/atlas/disk/atlasdatadisk/rucio/valid1/20/76/log.32413688._000229 [2023-03-14 18:19:53] 2023-03-14 17:19:28,628 \| INFO \| executing command: ps aux -q 306808 [2023-03-14 18:19:53] 2023-03-14 17:19:28,656 \| INFO \| summary of transferred files: [2023-03-14 18:19:53] 2023-03-14 17:19:28,657 \| INFO \| -- lfn=log.32413688._000229-57598-1678802947.job.log.tgz.1, status_code=0, status=transferred [2023-03-14 18:19:53] 2023-03-14 17:19:28,657 \| INFO \| stage-out finished correctly [2023-03-14 18:19:53] 2023-03-14 17:19:28,704 \| INFO \| finished stage-out for finished payload, adding job to finished_jobs queue [2023-03-14 18:19:53] 2023-03-14 17:19:29,298 \| INFO \| job 5753961892 has state=finished [2023-03-14 18:19:53] 2023-03-14 17:19:29,299 \| INFO \| preparing for final server update for job 5753961892 in state='finished' [2023-03-14 18:19:53] 2023-03-14 17:19:29,299 \| INFO \| pilot will not update the server (heartbeat message will be written to file) [2023-03-14 18:19:53] 2023-03-14 17:19:29,300 \| INFO \| job 5753961892 has finished - writing final server update [2023-03-14 18:19:53] 2023-03-14 17:19:29,300 \| WARNING \| format EVNTtoHITS has no such key: dbData [2023-03-14 18:19:53] 2023-03-14 17:19:29,301 \| WARNING \| format EVNTtoHITS has no such key: dbTime [2023-03-14 18:19:53] 2023-03-14 17:19:29,302 \| WARNING \| wrong length of table data, x=[1678813630.0, 1678813691.0, 1678813752.0, 1678813813.0, 1678813874.0], y=[1051252.0, 1090314.0, 1911720.0, 2161488.0, 2321179.0] (mu [2023-03-14 18:19:53] 2023-03-14 17:19:29,303 \| INFO \| total number of processed events: 5 (read) ID: 7965 · Rating: 0 · rate: / Reply Quote

Development for LHC@home