Message boards : ATLAS Application : ATLAS long simulation 1.01
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 7094 - Posted: 18 Mar 2021, 8:33:58 UTC

Hi all,

We are testing long tasks which run over 1000 events instead of 200, so please select "ATLAS very long simulation" from the app list and give us feedback here. It is currently native linux-only and a minimum of 4 CPUs is required.
ID: 7094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,296
RAC: 1,367
Message 7095 - Posted: 18 Mar 2021, 9:21:33 UTC - in response to Message 7094.  

This Task using Vers.1.01, but seem to be to early, only 2 Collisions:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2957693
[2021-03-18 09:55:19] Singularity works
[2021-03-18 09:56:09] Set ATHENA_PROC_NUMBER=7
[2021-03-18 09:56:09] Starting ATLAS job with PandaID=5002850558
[2021-03-18 09:56:09] Running command: /usr/bin/singularity exec --pwd /var/lib/boinc/slots/4 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh start_atlas.sh
[2021-03-18 10:13:06] *** The last 200 lines of the pilot log: ***
[2021-03-18 10:13:06] "externalCpuTime": 5,
[2021-03-18 10:13:06] "processedEvents": 2,
[2021-03-18 10:13:06] "trfPredata": null,
[2021-03-18 10:13:06] "wallTime": 879
ID: 7095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 7096 - Posted: 18 Mar 2021, 12:40:38 UTC - in response to Message 7095.  

There are still some short test tasks in the system which process 2 events. I've stopped submitting those short tasks now. The first batch of 10 real long tasks with 1000 events is now available. Once those are finished there will be more available.
ID: 7096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,296
RAC: 1,367
Message 7097 - Posted: 18 Mar 2021, 12:56:12 UTC - in response to Message 7096.  

This 10 Tasks are faster as light for other Volunteers running.
The message is, no Tasks avalaible.
ID: 7097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 7098 - Posted: 19 Mar 2021, 9:42:05 UTC - in response to Message 7097.  

There are more long tasks in the queue now. I finished one task successfully on one of my hosts and it took 12 hours using 4 cores.
ID: 7098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,296
RAC: 1,367
Message 7099 - Posted: 19 Mar 2021, 10:00:55 UTC - in response to Message 7098.  
Last modified: 19 Mar 2021, 10:54:33 UTC

Thank you David, first longrunning is downloaded.
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2064041
Edit:
Percent is growing very fast to 100 (in less than one hour).
Have 6 CPU's. Calculation run time: 8 hours for this 1000 Collisions.
ID: 7099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7100 - Posted: 19 Mar 2021, 12:33:14 UTC

Got 1 task but forgot to make the test account a member of the singularity group.
:-(

Now the queue is empty again.
:-(


<rsc_fpops_est> is currently set to 43200 GFLOPS.
Suggest to increase it by factor 5 before the next batch goes out.
ID: 7100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 7101 - Posted: 19 Mar 2021, 13:05:39 UTC

I got 1 task for my 4-core Ubuntu VM with 6144MB RAM -> https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958027
Originally estimated elapsed time to finish 40 minutes.
It lasted 22 minutes before 4 athena.py's were running.
Now the task is running for 1 hour (incl. initializing) and the 4 workers have done together 5 Events (338 seconds up to 1715 seconds)
Deadline: 26 Mar 2021, 11:58:53 UTC
ID: 7101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7102 - Posted: 19 Mar 2021, 13:13:28 UTC

Got a task that started fine.
BOINC estimates a runtime of 4 min.
;-D

This miscalculation could become a problem when the app moves over to prod since lots of computers will then pull lots of tasks.
The more cores a computer reports the greater the miscalculation.


The time to setup the task (~20 min until the worker threads start) is comparable to "ATLAS native short".
The logfiles look fine, especially log.EVNTtoHITS.
ID: 7102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 26 Feb 15
Posts: 26
Credit: 3,149,971
RAC: 0
Message 7103 - Posted: 19 Mar 2021, 14:11:02 UTC
Last modified: 19 Mar 2021, 14:11:44 UTC

I am getting nothing but errors:

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2958101

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
05:25:13 (131525): wrapper (7.7.26015): starting
05:25:13 (131525): wrapper: running run_atlas (--nthreads 32)
[2021-03-19 05:25:13] Arguments: --nthreads 32
[2021-03-19 05:25:13] Threads: 32
[2021-03-19 05:25:13] Checking for CVMFS
[2021-03-19 05:25:13] No cvmfs_config command found, will try listing directly
[2021-03-19 05:25:13] ls: cannot access '/cvmfs/atlas.cern.ch/repo/sw': No such file or directory
[2021-03-19 05:25:13] Failed to list /cvmfs/atlas.cern.ch/repo/sw
05:35:14 (131525): run_atlas exited; CPU time 0.005612
05:35:14 (131525): app exit status: 0x1
05:35:14 (131525): called boinc_finish(195)

</stderr_txt>
]]>
ID: 7103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7104 - Posted: 19 Mar 2021, 14:27:46 UTC - in response to Message 7103.  

...No cvmfs_config command found...

This is an ATLAS native test.
ATLAS native requires a local CVMFS client to be installed and correctly configured.

See the message board threads over at the -prod pages.
ID: 7104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,539
RAC: 84
Message 7105 - Posted: 19 Mar 2021, 14:34:14 UTC

I've increased the flops estimate by a factor 5 for new tasks. I think BOINC's estimation is also based on previous tasks and since they have been very short the estimation is very small.

Note that for this app the max CPUs is 48, so if you have a big machine please give 48-core tasks a try!
ID: 7105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 26 Feb 15
Posts: 26
Credit: 3,149,971
RAC: 0
Message 7106 - Posted: 19 Mar 2021, 14:44:11 UTC - in response to Message 7104.  

...No cvmfs_config command found...

This is an ATLAS native test.
ATLAS native requires a local CVMFS client to be installed and correctly configured.

See the message board threads over at the -prod pages.


I assumed "native" just meant meant that a VM was no longer needed (no vbox required). It's been so long since I ran the native thing, I forgot all about all the additional manual steps required.
Reno, NV
Team: SETI.USA
ID: 7106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7107 - Posted: 19 Mar 2021, 14:51:37 UTC - in response to Message 7106.  

I assumed "native" just ... meant that a VM was no longer needed (no vbox required)

That's correct.
But it also means that you are responsible to provide the environment that otherwise would be provided by the VM.

You are welcome for testing but should first make yourself familiar with the required environment.
ID: 7107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 26 Feb 15
Posts: 26
Credit: 3,149,971
RAC: 0
Message 7108 - Posted: 19 Mar 2021, 14:57:46 UTC

Yes, I ran the native stuff years ago. It's just been so long that I forgot all about it.
Reno, NV
Team: SETI.USA
ID: 7108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7109 - Posted: 19 Mar 2021, 15:07:43 UTC

Just a hint for a monitoring one-liner.
Open a console and cd to your BOINC working directory.
watch -n 50 -d -x sh -c "find ./slots -name \"AthenaMP.log\" |sort |xargs -n1 -I {} sh -c \"grep 'New average' {} |tail -n1\""


May still be a long list on a 48-core machine.
:-)
ID: 7109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7110 - Posted: 19 Mar 2021, 18:33:40 UTC

40% done after 5 h 48 min (4 worker threads)
Estimated time left: 8 h 15 min + stage-out

2021-03-19 19:26:52,535 ISFG4SimSvc          INFO        Event nr. 104 took 257.2 s. New average 186.6 +- 7.604
2021-03-19 19:25:21,541 ISFG4SimSvc          INFO        Event nr. 96 took 161.5 s. New average 200.6 +- 8.843
2021-03-19 19:25:38,851 ISFG4SimSvc          INFO        Event nr. 96 took 201 s. New average 201.4 +- 7.967
2021-03-19 19:27:27,660 ISFG4SimSvc          INFO        Event nr. 105 took 302.8 s. New average 183.6 +- 6.909
ID: 7110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,296
RAC: 1,367
Message 7111 - Posted: 19 Mar 2021, 21:36:44 UTC - in response to Message 7099.  

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2064041
CentOS8-VM Ryzen 3950x 6 Core - 11.5 hour runtime. 650 MByte upload.
Tomorrow the same Test for two Ryzen2700 with 6 Cores in a CentOS8-VM.
In a AMD FX-8370E is a Task running also with 6 Core - Now 10 hour so long.
ID: 7111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 20 Jun 17
Posts: 22
Credit: 1,073,294
RAC: 0
Message 7113 - Posted: 20 Mar 2021, 4:18:47 UTC
Last modified: 20 Mar 2021, 4:19:06 UTC

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
23:47:51 (28346): wrapper (7.7.26015): starting
23:47:51 (28346): wrapper: running run_atlas (--nthreads 4)
[2021-03-19 23:47:51] Arguments: --nthreads 4
[2021-03-19 23:47:51] Threads: 4
[2021-03-19 23:47:51] Checking for CVMFS
[2021-03-19 23:47:52] Probing /cvmfs/atlas.cern.ch... OK
[2021-03-19 23:47:53] Probing /cvmfs/atlas-condb.cern.ch... OK
[2021-03-19 23:47:54] Probing /cvmfs/grid.cern.ch... OK
[2021-03-19 23:47:56] VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
[2021-03-19 23:47:56] 2.5.2.0 28470 0 23296 81113 3 1 2621483 4194304 0 65024 0 0 n/a 0 0 http://cvmfs-s1bnl.opensciencegrid.org/cvmfs/atlas.cern.ch DIRECT 1
[2021-03-19 23:47:56] CVMFS is ok
[2021-03-19 23:47:56] Using singularity image /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2021-03-19 23:47:56] Checking for singularity binary...
[2021-03-19 23:47:56] Using singularity found in PATH at /usr/bin/singularity
[2021-03-19 23:47:56] Running /usr/bin/singularity --version
[2021-03-19 23:47:56] 2.4.2-dist
[2021-03-19 23:47:56] Checking singularity works with /usr/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname
[2021-03-19 23:47:56] Singularity isnt working: ERROR : Unknown image format/type: /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img
[2021-03-19 23:47:56] ABORT : Retval = 255
[2021-03-19 23:47:56] 
23:57:56 (28346): run_atlas exited; CPU time 0.301484
23:57:56 (28346): app exit status: 0x1
23:57:56 (28346): called boinc_finish(195)

</stderr_txt>
]]>

singularity --version
This returns a version as the check listed on main LHC forums. I've ran the native app before but it's been awhile.
ID: 7113 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sergey Kovalchuk

Send message
Joined: 11 Mar 16
Posts: 23
Credit: 68,680
RAC: 0
Message 7116 - Posted: 20 Mar 2021, 4:41:52 UTC - in response to Message 7113.  
Last modified: 20 Mar 2021, 4:50:54 UTC

Unknown image format/type: /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img

try a version of singularity from the server

/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec -B /cvmfs /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img hostname


maybe a hint will appear, e.g. "unsquashfs not found" or "mkdir /home/boinc: permission denied"

PS if it works - just delete the installed singularity
ID: 7116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : ATLAS Application : ATLAS long simulation 1.01


©2023 CERN