Message boards : ATLAS Application : ATLAS long simulation 1.01
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,807,614
RAC: 2,394
Message 7117 - Posted: 20 Mar 2021, 6:08:46 UTC - in response to Message 7111.  
Last modified: 20 Mar 2021, 6:10:21 UTC

In a AMD FX-8370E is a Task running also with 6 Core - Now 10 hour so long.

Finished atm 18 hour runtime and 4 days duration. No Computer for this 1k Collisions. Also CentOS8-VM.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958037
ID: 7117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tazzduke

Send message
Joined: 19 Apr 15
Posts: 4
Credit: 71,032
RAC: 35
Message 7118 - Posted: 20 Mar 2021, 6:22:44 UTC - in response to Message 7109.  

Just a hint for a monitoring one-liner.
Open a console and cd to your BOINC working directory.
watch -n 50 -d -x sh -c "find ./slots -name \"AthenaMP.log\" |sort |xargs -n1 -I {} sh -c \"grep 'New average' {} |tail -n1\""


May still be a long list on a 48-core machine.
:-)


Greetings All,

Running the following workunit, on a XEON E5-2620 V2 using 6 cores, OS is Linux Mint 20.01

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958159

Using the information above that was posted by computezrmle, this was the output so far.

2021-03-20 14:09:34,009 ISFG4SimSvc INFO Event nr. 156 took 286.8 s. New average 292 +- 9.647
2021-03-20 14:14:26,078 ISFG4SimSvc INFO Event nr. 171 took 252.7 s. New average 269.4 +- 8.411
2021-03-20 14:14:15,815 ISFG4SimSvc INFO Event nr. 164 took 317.9 s. New average 280.9 +- 8.913
2021-03-20 14:13:36,407 ISFG4SimSvc INFO Event nr. 157 took 398.3 s. New average 292.1 +- 9.11
2021-03-20 14:11:33,962 ISFG4SimSvc INFO Event nr. 156 took 196.8 s. New average 293.1 +- 9.583
2021-03-20 14:13:03,249 ISFG4SimSvc INFO Event nr. 166 took 403.2 s. New average 276.1 +- 8.551

Workunit is still running at the moment, at 99.782% (running so far for 14hrs and 37 mins)

Completed an ATLAS Native task from the main project and it completed successfully, to ensure integrity of system.

Now awaiting for workunit to complete before further action on my behalf.

Cheers
ID: 7118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 7119 - Posted: 20 Mar 2021, 8:27:30 UTC

Logfile entries written to stderr.txt at the beginning of the task are missing in the final report.
They are required to identify a CVMFS or Singularity misconfiguration.
ID: 7119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 11 Mar 16
Posts: 23
Credit: 68,680
RAC: 0
Message 7120 - Posted: 20 Mar 2021, 8:56:59 UTC - in response to Message 7119.  

Logfile entries written to stderr.txt at the beginning of the task are missing in the final report.
They are required to identify a CVMFS or Singularity misconfiguration.

the beginning of the file is lost if the payload is started

it might make sense to have a separate test application
without a payload, only diagnostic log entries
to check the suitability of the host and project settings (memory, threads, etc.)

Once upon a time there was a similar Benchmark Application
ID: 7120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tazzduke

Send message
Joined: 19 Apr 15
Posts: 4
Credit: 71,032
RAC: 35
Message 7122 - Posted: 20 Mar 2021, 11:04:51 UTC - in response to Message 7118.  

Just a hint for a monitoring one-liner.
Open a console and cd to your BOINC working directory.
watch -n 50 -d -x sh -c "find ./slots -name \"AthenaMP.log\" |sort |xargs -n1 -I {} sh -c \"grep 'New average' {} |tail -n1\""


May still be a long list on a 48-core machine.
:-)


Greetings All,

Running the following workunit, on a XEON E5-2620 V2 using 6 cores, OS is Linux Mint 20.01

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958159

Using the information above that was posted by computezrmle, this was the output so far.

2021-03-20 14:09:34,009 ISFG4SimSvc INFO Event nr. 156 took 286.8 s. New average 292 +- 9.647
2021-03-20 14:14:26,078 ISFG4SimSvc INFO Event nr. 171 took 252.7 s. New average 269.4 +- 8.411
2021-03-20 14:14:15,815 ISFG4SimSvc INFO Event nr. 164 took 317.9 s. New average 280.9 +- 8.913
2021-03-20 14:13:36,407 ISFG4SimSvc INFO Event nr. 157 took 398.3 s. New average 292.1 +- 9.11
2021-03-20 14:11:33,962 ISFG4SimSvc INFO Event nr. 156 took 196.8 s. New average 293.1 +- 9.583
2021-03-20 14:13:03,249 ISFG4SimSvc INFO Event nr. 166 took 403.2 s. New average 276.1 +- 8.551

Workunit is still running at the moment, at 99.782% (running so far for 14hrs and 37 mins)

Completed an ATLAS Native task from the main project and it completed successfully, to ensure integrity of system.

Now awaiting for workunit to complete before further action on my behalf.

Cheers



Greetings

The following workunit returned valid with a hits file.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958159

Regards
ID: 7122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7123 - Posted: 20 Mar 2021, 11:10:07 UTC

My single 4-core task is crawling slowly. After 22 hours run time the 4 workers show:
2021-03-20 11:54:30,834 ISFG4SimSvc INFO Event nr. 66 took 1081 s. New average 1170 +- 64.03
2021-03-20 11:45:58,903 ISFG4SimSvc INFO Event nr. 66 took 968.2 s. New average 1166 +- 64.4
2021-03-20 11:52:57,758 ISFG4SimSvc INFO Event nr. 69 took 824.6 s. New average 1134 +- 61.04
2021-03-20 11:44:35,597 ISFG4SimSvc INFO Event nr. 64 took 1503 s. New average 1188 +- 56.21

I will definitively not run these long jobs on this machine when they go into production.
ID: 7123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 20 Jun 17
Posts: 25
Credit: 3,961,376
RAC: 33,390
Message 7124 - Posted: 20 Mar 2021, 12:20:28 UTC - in response to Message 7116.  
Last modified: 20 Mar 2021, 12:29:40 UTC

Unknown image format/type: /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img

try a version of singularity from the server

/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname


maybe a hint will appear, e.g. "unsquashfs not found" or "mkdir /home/boinc: permission denied"

PS if it works - just delete the installed singularity


That command returns my host name. No errors.

Delete it? Like
sudo apt-get remove --auto-remove singularity 


Edit: 2 PCs are getting this. A 3rd where I just went through the setup thread on LHC now works, after installing gawk, at lease its starting to use some memory/CPU. The two PCs may have singularity installed via a repository some time ago instead of via cmake etc.
ID: 7124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 7125 - Posted: 20 Mar 2021, 12:59:50 UTC - in response to Message 7123.  

Is it this computer?
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3717

I would expect it to require an average of around 290-300 s per event.
1170 s would be very close to a factor of 4.0.

Guess its a Linux guest VM running on a Windows host.
Could you check if the guest really runs on 4 cores?
The numbers make me suspect the VM is allowed to use only 1 core.
ID: 7125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7126 - Posted: 20 Mar 2021, 15:01:47 UTC - in response to Message 7125.  

Is it this computer?
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3717
Yes it is.
The Linux VM uses 4 threads from that i7 2600 (4 cores - 8 threads)
The other 4 threads on the host are for 1 long-running multi-core PrimeGrid job and personal PC-usage.
I expect that PG-job to be ready somewhere sunday evening / monday morning. That will speed up the ATLAS-task a bit.
The Linux VM gets 50% CPU-usage all the time from the Win10 host.
ID: 7126 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,807,614
RAC: 2,394
Message 7127 - Posted: 21 Mar 2021, 4:24:00 UTC - in response to Message 7111.  

Tomorrow the same Test for two ONE Ryzen2700 with 6 Cores in a CentOS8-VM.

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2064234
ID: 7127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,807,614
RAC: 2,394
Message 7128 - Posted: 21 Mar 2021, 6:57:01 UTC
Last modified: 21 Mar 2021, 7:00:59 UTC

cpuconsumptiontime: 190923 s - AMD Ryzen 9 3950X
cpuconsumptiontime: 360969 s - AMD FX-8370E
cpuconsumptiontime: 228732 s - AMD Ryzen 7 2700
All CentOS8-VM - 6 Cpu
ID: 7128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7129 - Posted: 21 Mar 2021, 8:54:03 UTC

Halfway my first and last (on this host) ATLAS long simulation:
2021-03-21 09:37:09,229 ISFG4SimSvc          INFO        Event nr. 124 took 2482 s. New average 1230 +- 47.41
2021-03-21 09:25:02,948 ISFG4SimSvc          INFO        Event nr. 122 took 595.1 s. New average 1246 +- 49.64
2021-03-21 09:39:32,900 ISFG4SimSvc          INFO        Event nr. 137 took 724.8 s. New average 1122 +- 43.15
2021-03-21 09:29:00,688 ISFG4SimSvc          INFO        Event nr. 127 took 1577 s. New average 1191 +- 41.43
ID: 7129 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 7130 - Posted: 21 Mar 2021, 9:23:38 UTC - in response to Message 7129.  

Still wondering.
Can you post a "top" output from that VM?
ID: 7130 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7131 - Posted: 21 Mar 2021, 10:21:40 UTC - in response to Message 7130.  
Last modified: 21 Mar 2021, 10:22:22 UTC

Tasks: 234 total,   6 running, 228 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1,9 us,  3,1 sy, 95,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
MiB Mem :   5960,4 total,    425,5 free,   3558,1 used,   1976,8 buff/cache
MiB Swap:   1186,4 total,   1163,1 free,     23,3 used.   1982,9 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                          
16248 boinc     39  19 2699732   1,9g 114680 R  98,0  32,6   2650:52 athena.py                                                                                                        
16247 boinc     39  19 2693440   1,9g 114536 R  93,5  32,5   2650:43 athena.py                                                                                                        
16251 boinc     39  19 2696040   1,9g 114388 R  93,5  32,6   2651:36 athena.py                                                                                                        
16254 boinc     39  19 2696516   1,9g 116260 R  91,5  32,5   2650:58 athena.py                                                                                                        
 7961 boinc     39  19 1339856  77012   8168 S   2,9   1,3  91:32.37 python2                                                                                                          
18231 boinc     39  19    1324     52      0 R   1,6   0,0   0:00.05 sh                                                                                                               
15464 boinc     39  19 2665428   1,8g 136004 S   0,3  31,1  12:16.86 athena.py                                                                                                        
 1565 boinc     30  10  238772  19992  12284 S   0,0   0,3  12:42.12 boinc                                                                                                            
 3400 boinc     30  10    6180   3276   2768 S   0,0   0,1  10:31.57 wrapper_26015_x                                                                                                  
 3402 boinc     39  19   20124   3304   3004 S   0,0   0,1   0:00.01 run_atlas                                                                                                        
 3403 boinc     39  19   20124    268      0 S   0,0   0,0   0:00.00 run_atlas                                                                                                        
 3405 boinc     39  19   30700   1952   1724 S   0,0   0,0   0:00.00 awk                                                                                                              
 4425 boinc     39  19  634744  13568   9492 S   0,0   0,2   0:02.93 starter                                                                                                          
 4479 boinc     39  19   15148   3296   3024 S   0,0   0,1   0:00.10 sh                                                                                                               
 4527 boinc     39  19    4364    708    632 S   0,0   0,0   0:00.00 time                                                                                                             
 4528 boinc     39  19   15672   3608   2848 S   0,0   0,1   0:00.84 runpilot2-wrapp                                                                                                  
10893 boinc     39  19   16076   4148   2864 S   0,0   0,1   0:13.02 bash
ID: 7131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 7132 - Posted: 21 Mar 2021, 16:01:14 UTC - in response to Message 7131.  

Nothing in there that explains the long event runtimes.
ID: 7132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7141 - Posted: 22 Mar 2021, 14:53:11 UTC

It appeared on the server status page as the max runtime in hours of last 100 tasks:

ATLAS very long simulation 428 28 11.15 (3.59 - 74.28)

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958027
ID: 7141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,807,614
RAC: 2,394
Message 7142 - Posted: 22 Mar 2021, 15:57:19 UTC

Crystal,
your task is the Winner in runtime ;-)

This are your last lines from your task:
[2021-03-22 15:17:28] -rw------- 1 boinc boinc 17756160 mrt 22 15:17 result.tar.gz
[2021-03-22 15:17:28] -rw------- 1 boinc boinc 599 mrt 22 15:17 asKMDmxaQhynfZGDcpSWOuwoABFKDmABFKDm2IFNDmGDFKDm36iUdn.diag
[2021-03-22 15:17:28] -rw-r--r-- 1 boinc boinc 8674 mrt 22 15:17 runtime_log.err
[2021-03-22 15:17:28] -rw-r--r-- 1 boinc boinc 397274 mrt 22 15:17 stderr.txt
[2021-03-22 15:17:28] HITS file was successfully produced:
[2021-03-22 15:17:28] -rw------- 1 boinc boinc 667310809 mrt 22 15:14 shared/HITS.pool.root.1
[2021-03-22 15:17:28] *** Contents of shared directory: ***
[2021-03-22 15:17:28] total 1017256

From my tasks is the line with the HITS file not showing:
[2021-03-20 07:00:19] -rw-------. 1 boinc boinc 7219200 20. Mär 07:00 result.tar.gz
[2021-03-20 07:00:19] -rw-------. 1 boinc boinc 580 20. Mär 07:00 L1xKDm3aQhynfZGDcpSWOuwoABFKDmABFKDm2IFNDmQDFKDmwxvb5m.diag
[2021-03-20 07:00:19] -rw-r--r--. 1 boinc boinc 389757 20. M&#19
ID: 7142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 7143 - Posted: 22 Mar 2021, 16:07:32 UTC - in response to Message 7142.  

Crystal,
your task is the Winner in runtime ;-)
.
.
From my tasks is the line with the HITS file not showing:
Yeah, I saw the upload of my 636MB HITS file.
ID: 7143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 26 Feb 15
Posts: 26
Credit: 4,101,356
RAC: 0
Message 7144 - Posted: 22 Mar 2021, 20:34:28 UTC

Let's say you have a 32 core machine. Is it better to run a single task using 32 cores? Or 4 tasks using 8 cores each? Or 8 tasks using 4 cores each? I have tried all three configurations and the credits per hour work out to be roughly the same. As far as I can tell, there is no advantage to me regardless of configuration.

So my question is, does the project have a preference?
Reno, NV
Team: SETI.USA
ID: 7144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 7145 - Posted: 22 Mar 2021, 21:48:06 UTC - in response to Message 7144.  

The tasks have 3 main phases.
1. setup
2. calculation
3. stage out


Phases 1 and 3 run on 1 core.
Phase 2 runs on n cores (in this case real POSIX threads) depending on your setup.

At the end of phase 2 not all threads will finish their work at the same time, hence there will be some waste.
This waste depends on
1. how long the average runtime for a single event is (see CP's extreme long runtimes)
2. If your computer idles the free cores or not.

The more events a tasks processes in total the lower the influence of phases 1/3 and the waste.


Running many low-core tasks concurrently requires more RAM and more total runtime but a bit less waste.
Running a few high-core tasks requires less RAM and less total runtime but a bit more waste.


David Cameron wrote:
ATLAS systems cancel any tasks which have been queued for more than two days

I would suggest to take this as a hint and use a setup that allows a task to finish within 1-1.5 days.
Beside that it's a matter of personal preference.
ID: 7145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : ATLAS Application : ATLAS long simulation 1.01


©2024 CERN