Message boards :
ATLAS Application :
ATLAS vbox and native 3.01
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
v3.01 was just released. This contains updated ATLAS software for the latest version of Run 3 simulation. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 819 |
First Task Win10pro finished: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193237 |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I've started submitting tasks with the Run 3 software now - these are tasks with input file EVNT.29838250._000010.pool.root.1. Please post if you see any problems or notice anything strange, for example in the event monitor or other places. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 819 |
Atlas-native:https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193330 [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | making sure that job.state is set to failed since a pilot error code is set [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | format EVNTtoHITS has no such key: dbData [2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | format EVNTtoHITS has no such key: dbTime [2023-03-13 19:35:44] 2023-03-13 18:35:18,762 | WARNING | wrong length of table data, x=[1678732251.0], y=[623076.0] (must be same and length>=4) [2023-03-13 19:35:44] 2023-03-13 18:35:18,762 | INFO | payload/TRF did not report the number of read events |
Send message Joined: 8 Apr 15 Posts: 778 Credit: 12,143,363 RAC: 2,436 |
Thanks for always being with us over the years David Hope all is great at your new job |
Send message Joined: 24 Oct 19 Posts: 165 Credit: 392,902 RAC: 2,415 |
I see 99 Atlas wus queued, but if i try to download the message is "got 0 new task" |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low: I'm running Ubuntu 22.04.x Oh, I see, results have got uploaded now, both say "Hits file was produced successfull": Not shure, if this is really true https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193657 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193668 |
Send message Joined: 24 Oct 19 Posts: 165 Credit: 392,902 RAC: 2,415 |
Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version. That's true. I downloaded correctly native app in my linux box.[/quote] |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low: The tasks are successful, but the run time is quite large compared to the CPU time. These are short tasks simulating 5 events each, I will submit some longer ones with 50 events to see the difference. I think the vbox tasks are also working ok now. |
Send message Joined: 8 Apr 15 Posts: 778 Credit: 12,143,363 RAC: 2,436 |
I'm ready for some long Windows vbox version work. |
Send message Joined: 24 Oct 19 Posts: 165 Credit: 392,902 RAC: 2,415 |
I think the vbox tasks are also working ok now. Not for windows 14/03/2023 21:26:23 | lhcathome-dev | Requesting new tasks for CPU and AMD/ATI GPU |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 819 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193785 Have five in the pipeline, runtime always 50 min (5 CPU's) Win10pro. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I'm submitting some longer tasks now (500 events). |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 819 |
First one is running:https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2287426 |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Findings: 1. On a fully loaded Threadripper it takes ~20 min to complete the initial setup. This is mainly caused by lots of downloads from CVMFS and Frontier. The long setup phase has already been mentioned by other testers. 2. Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io 3. Looks like worker threads do not log their progress into separate logfiles any more. This means ATLAS event monitoring (vbox app) will not work any more. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 819 |
Atlas Event Progress Monitoring(v.4.1.0) show 500 events total, but no other details. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Findings: A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching. 2. Right, this is an error on my part, I will fix it. 3. This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested) This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now? |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Findings: To test this and ensure comparable total load I'll run another task when the 1st one has finished. 2. Monitoring needs to be revised. The old ATLAS version wrote it's log information into a couple of files. In case of a singlecore setup only "log.EVNTtoHITS" was used. In case of multicore "log.EVNTtoHITS" was used for global information like the #events but the event status was logged to the worker logs. Now its all in "log.EVNTtoHITS" but the output patterns are not consistent, e.g.: 11:18:13 ISF_Kernel_FullG4MT_QS.ISF_LongLivedGeant4Tool 108 2 INFO Run:Event 410000:99609 (29th event for this worker) took 239.9 s. New average 450.6 +- 44.52 . . . 11:18:14 AthenaHiveEventLoopMgr 108 2 INFO ===>>> done processing event #99609, run #410000 on slot 2, 106 events processed so far <<<=== |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching. I can not confirm this expected behaviour. First: Remember, I have set up a central Squid-Proxy that works fine. You can check all my active clients, they are all running one DEV-Task after another. So, I would expect that the second or third Task would need less startup time, but they all need the same time. Example: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4703 In the log-file I see still these warnings: [2023-03-14 18:19:53] 2023-03-14 17:19:10,304 | INFO | [attempt=2/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json [2023-03-14 18:19:53] 2023-03-14 17:19:10,396 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: [2023-03-14 18:19:53] 2023-03-14 17:19:10,397 | INFO | will try again after 18s.. [2023-03-14 18:19:53] 2023-03-14 17:19:28,484 | INFO | [attempt=3/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json [2023-03-14 18:19:53] 2023-03-14 17:19:28,571 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: [2023-03-14 18:19:53] 2023-03-14 17:19:28,572 | WARNING | cache file=/var/lib/boinc_data/boinc-01/slots/2/agis_ddmendpoints.agis.ALL.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc_data/boinc-0 [2023-03-14 18:19:53] 2023-03-14 17:19:28,606 | INFO | transferring file log.32413688._000229-57598-1678802947.job.log.tgz.1 from /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1 [2023-03-14 18:19:53] 2023-03-14 17:19:28,606 | INFO | executing command: /usr/bin/env mv /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1678802947.job.log.tgz.1 /var/lib/boinc_d [2023-03-14 18:19:53] 2023-03-14 17:19:28,627 | INFO | Adding to output.list: log.32413688._000229-57598-1678802947.job.log.tgz.1 davs://dav.ndgf.org:443/atlas/disk/atlasdatadisk/rucio/valid1/20/76/log.32413688._000229 [2023-03-14 18:19:53] 2023-03-14 17:19:28,628 | INFO | executing command: ps aux -q 306808 [2023-03-14 18:19:53] 2023-03-14 17:19:28,656 | INFO | summary of transferred files: [2023-03-14 18:19:53] 2023-03-14 17:19:28,657 | INFO | -- lfn=log.32413688._000229-57598-1678802947.job.log.tgz.1, status_code=0, status=transferred [2023-03-14 18:19:53] 2023-03-14 17:19:28,657 | INFO | stage-out finished correctly [2023-03-14 18:19:53] 2023-03-14 17:19:28,704 | INFO | finished stage-out for finished payload, adding job to finished_jobs queue [2023-03-14 18:19:53] 2023-03-14 17:19:29,298 | INFO | job 5753961892 has state=finished [2023-03-14 18:19:53] 2023-03-14 17:19:29,299 | INFO | preparing for final server update for job 5753961892 in state='finished' [2023-03-14 18:19:53] 2023-03-14 17:19:29,299 | INFO | pilot will not update the server (heartbeat message will be written to file) [2023-03-14 18:19:53] 2023-03-14 17:19:29,300 | INFO | job 5753961892 has finished - writing final server update [2023-03-14 18:19:53] 2023-03-14 17:19:29,300 | WARNING | format EVNTtoHITS has no such key: dbData [2023-03-14 18:19:53] 2023-03-14 17:19:29,301 | WARNING | format EVNTtoHITS has no such key: dbTime [2023-03-14 18:19:53] 2023-03-14 17:19:29,302 | WARNING | wrong length of table data, x=[1678813630.0, 1678813691.0, 1678813752.0, 1678813813.0, 1678813874.0], y=[1051252.0, 1090314.0, 1911720.0, 2161488.0, 2321179.0] (mu [2023-03-14 18:19:53] 2023-03-14 17:19:29,303 | INFO | total number of processed events: 5 (read) |
©2024 CERN