Message boards : ATLAS Application : ATLAS vbox and native 3.01
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7946 - Posted: 13 Mar 2023, 10:29:08 UTC

v3.01 was just released. This contains updated ATLAS software for the latest version of Run 3 simulation.
ID: 7946 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 7947 - Posted: 13 Mar 2023, 11:14:28 UTC - in response to Message 7946.  

ID: 7947 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7948 - Posted: 13 Mar 2023, 11:26:28 UTC

I've started submitting tasks with the Run 3 software now - these are tasks with input file EVNT.29838250._000010.pool.root.1.

Please post if you see any problems or notice anything strange, for example in the event monitor or other places.
ID: 7948 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 7949 - Posted: 13 Mar 2023, 19:25:33 UTC - in response to Message 7948.  
Last modified: 13 Mar 2023, 19:26:36 UTC

Atlas-native:https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193330
[2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | making sure that job.state is set to failed since a pilot error code is set
[2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | format EVNTtoHITS has no such key: dbData
[2023-03-13 19:35:44] 2023-03-13 18:35:18,761 | WARNING | format EVNTtoHITS has no such key: dbTime
[2023-03-13 19:35:44] 2023-03-13 18:35:18,762 | WARNING | wrong length of table data, x=[1678732251.0], y=[623076.0] (must be same and length>=4)
[2023-03-13 19:35:44] 2023-03-13 18:35:18,762 | INFO | payload/TRF did not report the number of read events
ID: 7949 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,366,630
RAC: 4,304
Message 7950 - Posted: 14 Mar 2023, 10:52:30 UTC

Thanks for always being with us over the years David
Hope all is great at your new job
ID: 7950 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boboviz

Send message
Joined: 24 Oct 19
Posts: 170
Credit: 543,238
RAC: 882
Message 7951 - Posted: 14 Mar 2023, 14:18:24 UTC

I see 99 Atlas wus queued, but if i try to download the message is "got 0 new task"
ID: 7951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7952 - Posted: 14 Mar 2023, 15:33:42 UTC

Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version.
ID: 7952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 7953 - Posted: 14 Mar 2023, 15:58:27 UTC

Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low:



I'm running Ubuntu 22.04.x

Oh, I see, results have got uploaded now, both say "Hits file was produced successfull": Not shure, if this is really true

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193657

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193668
ID: 7953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boboviz

Send message
Joined: 24 Oct 19
Posts: 170
Credit: 543,238
RAC: 882
Message 7954 - Posted: 14 Mar 2023, 17:11:59 UTC - in response to Message 7952.  
Last modified: 14 Mar 2023, 17:14:24 UTC

Looks like native tasks run well now, but I'm still ironing out a few issues with the vbox version.


That's true.
I downloaded correctly native app in my linux box.[/quote]
ID: 7954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7955 - Posted: 14 Mar 2023, 18:27:06 UTC - in response to Message 7953.  

Sorry, but for me, it looks very unusual . These 2 tasks seemed to run endless, CPU-Time was way to low:



I'm running Ubuntu 22.04.x

Oh, I see, results have got uploaded now, both say "Hits file was produced successfull": Not shure, if this is really true

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193657

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193668


The tasks are successful, but the run time is quite large compared to the CPU time. These are short tasks simulating 5 events each, I will submit some longer ones with 50 events to see the difference.

I think the vbox tasks are also working ok now.
ID: 7955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,366,630
RAC: 4,304
Message 7956 - Posted: 14 Mar 2023, 20:05:24 UTC

I'm ready for some long Windows vbox version work.
ID: 7956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boboviz

Send message
Joined: 24 Oct 19
Posts: 170
Credit: 543,238
RAC: 882
Message 7957 - Posted: 14 Mar 2023, 20:28:42 UTC - in response to Message 7955.  

I think the vbox tasks are also working ok now.


Not for windows
14/03/2023 21:26:23 | lhcathome-dev | Requesting new tasks for CPU and AMD/ATI GPU
14/03/2023 21:26:24 | lhcathome-dev | Scheduler request completed: got 0 new tasks
14/03/2023 21:26:24 | lhcathome-dev | No tasks sent
14/03/2023 21:26:24 | lhcathome-dev | No tasks are available for ATLAS Simulation
ID: 7957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 7958 - Posted: 14 Mar 2023, 20:56:22 UTC - in response to Message 7957.  

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3193785
Have five in the pipeline, runtime always 50 min (5 CPU's) Win10pro.
ID: 7958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7959 - Posted: 15 Mar 2023, 8:45:03 UTC

I'm submitting some longer tasks now (500 events).
ID: 7959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 7960 - Posted: 15 Mar 2023, 8:58:26 UTC - in response to Message 7959.  

ID: 7960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 7961 - Posted: 15 Mar 2023, 9:43:18 UTC

Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.
ID: 7961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 1
Message 7962 - Posted: 15 Mar 2023, 10:04:28 UTC - in response to Message 7960.  

Atlas Event Progress Monitoring(v.4.1.0) show 500 events total, but no other details.
ID: 7962 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 7963 - Posted: 15 Mar 2023, 10:04:41 UTC - in response to Message 7961.  

Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.

2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


Right, this is an error on my part, I will fix it.

3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.


This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested)

This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now?
ID: 7963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 7964 - Posted: 15 Mar 2023, 10:31:10 UTC - in response to Message 7963.  

Findings:

1.
On a fully loaded Threadripper it takes ~20 min to complete the initial setup.
This is mainly caused by lots of downloads from CVMFS and Frontier.
The long setup phase has already been mentioned by other testers.


A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.

To test this and ensure comparable total load I'll run another task when the 1st one has finished.


2.
Frontier requests are sent to atlasfrontier-ai.cern.ch although they should go to atlascern-frontier.openhtc.io


Right, this is an error on my part, I will fix it.

3.
Looks like worker threads do not log their progress into separate logfiles any more.
This means ATLAS event monitoring (vbox app) will not work any more.


This is a feature of the way the new software works. Instead of multiple processes, it uses multiple threads which makes memory usage much more efficient. (background reading for anyone interested)

This means all the threads log to a single file, log.EVNTtoHITS, which is how single core tasks worked before. I suppose the monitoring worked for single core tasks in the past so it should still work now?

Monitoring needs to be revised.
The old ATLAS version wrote it's log information into a couple of files.
In case of a singlecore setup only "log.EVNTtoHITS" was used.
In case of multicore "log.EVNTtoHITS" was used for global information like the #events but the event status was logged to the worker logs.

Now its all in "log.EVNTtoHITS" but the output patterns are not consistent, e.g.:
11:18:13 ISF_Kernel_FullG4MT_QS.ISF_LongLivedGeant4Tool       108     2    INFO 	 Run:Event 410000:99609	 (29th event for this worker) took 239.9 s. New average 450.6 +- 44.52
.
.
.
11:18:14 AthenaHiveEventLoopMgr                               108     2    INFO   ===>>>  done processing event #99609, run #410000 on slot 2,  106 events processed so far  <<<===
ID: 7964 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 7965 - Posted: 15 Mar 2023, 10:46:14 UTC - in response to Message 7963.  

A long setup time would be expected for the first native task, since the CVMFS cache needs to be filled with the new software libraries. I'd be interested to see the timing for subsequent tasks on the same host. On my test host running native tasks (inside CERN, so ideal conditions) and a warm cache it takes around 5 mins to start crunching.

I can not confirm this expected behaviour.

First: Remember, I have set up a central Squid-Proxy that works fine.

You can check all my active clients, they are all running one DEV-Task after another. So, I would expect that the second or third Task would need less startup time, but they all need the same time.

Example: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4703

In the log-file I see still these warnings:

[2023-03-14 18:19:53] 2023-03-14 17:19:10,304 | INFO | [attempt=2/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json
[2023-03-14 18:19:53] 2023-03-14 17:19:10,396 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[2023-03-14 18:19:53] 2023-03-14 17:19:10,397 | INFO | will try again after 18s..
[2023-03-14 18:19:53] 2023-03-14 17:19:28,484 | INFO | [attempt=3/3] loading data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json
[2023-03-14 18:19:53] 2023-03-14 17:19:28,571 | WARNING | failed to load data from url=https://atlas-cric.cern.ch/cache/ddmendpoints.json, error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[2023-03-14 18:19:53] 2023-03-14 17:19:28,572 | WARNING | cache file=/var/lib/boinc_data/boinc-01/slots/2/agis_ddmendpoints.agis.ALL.json is not available: [Errno 2] No such file or directory: '/var/lib/boinc_data/boinc-0
[2023-03-14 18:19:53] 2023-03-14 17:19:28,606 | INFO | transferring file log.32413688._000229-57598-1678802947.job.log.tgz.1 from /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1
[2023-03-14 18:19:53] 2023-03-14 17:19:28,606 | INFO | executing command: /usr/bin/env mv /var/lib/boinc_data/boinc-01/slots/2/PanDA_Pilot-5753961892/log.32413688._000229-57598-1678802947.job.log.tgz.1 /var/lib/boinc_d
[2023-03-14 18:19:53] 2023-03-14 17:19:28,627 | INFO | Adding to output.list: log.32413688._000229-57598-1678802947.job.log.tgz.1 davs://dav.ndgf.org:443/atlas/disk/atlasdatadisk/rucio/valid1/20/76/log.32413688._000229
[2023-03-14 18:19:53] 2023-03-14 17:19:28,628 | INFO | executing command: ps aux -q 306808
[2023-03-14 18:19:53] 2023-03-14 17:19:28,656 | INFO | summary of transferred files:
[2023-03-14 18:19:53] 2023-03-14 17:19:28,657 | INFO | -- lfn=log.32413688._000229-57598-1678802947.job.log.tgz.1, status_code=0, status=transferred
[2023-03-14 18:19:53] 2023-03-14 17:19:28,657 | INFO | stage-out finished correctly
[2023-03-14 18:19:53] 2023-03-14 17:19:28,704 | INFO | finished stage-out for finished payload, adding job to finished_jobs queue
[2023-03-14 18:19:53] 2023-03-14 17:19:29,298 | INFO | job 5753961892 has state=finished
[2023-03-14 18:19:53] 2023-03-14 17:19:29,299 | INFO | preparing for final server update for job 5753961892 in state='finished'
[2023-03-14 18:19:53] 2023-03-14 17:19:29,299 | INFO | pilot will not update the server (heartbeat message will be written to file)
[2023-03-14 18:19:53] 2023-03-14 17:19:29,300 | INFO | job 5753961892 has finished - writing final server update
[2023-03-14 18:19:53] 2023-03-14 17:19:29,300 | WARNING | format EVNTtoHITS has no such key: dbData
[2023-03-14 18:19:53] 2023-03-14 17:19:29,301 | WARNING | format EVNTtoHITS has no such key: dbTime
[2023-03-14 18:19:53] 2023-03-14 17:19:29,302 | WARNING | wrong length of table data, x=[1678813630.0, 1678813691.0, 1678813752.0, 1678813813.0, 1678813874.0], y=[1051252.0, 1090314.0, 1911720.0, 2161488.0, 2321179.0] (mu
[2023-03-14 18:19:53] 2023-03-14 17:19:29,303 | INFO | total number of processed events: 5 (read)
ID: 7965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : ATLAS Application : ATLAS vbox and native 3.01


©2024 CERN