Message boards : News : New native Linux ATLAS application
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4747 - Posted: 28 Feb 2017, 8:32:33 UTC - in response to Message 4746.  

Thanks, I think the way the BOINC client works on Ubuntu is different so that's why it's being killed. I will put a patch in for this straight away.
ID: 4747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4750 - Posted: 1 Mar 2017, 11:17:19 UTC

Today I tried another job:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=313224

The job finished and uploaded a result but it killed again the running boinc manager.
The client itself kept running.
ID: 4750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4752 - Posted: 1 Mar 2017, 20:00:25 UTC - in response to Message 4750.  

I found this in the log:

2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc'
2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr'
2017-03-01 10:55:30| 3314|processes.py| Found orphan process: pid=11385, ppid=1, args='dbus-launch'
2017-03-01 10:55:30| 3314|processes.py| Killed orphaned process 11385 (dbus-launch)
2017-03-01 10:55:30| 3314|processes.py| Found orphan process: pid=11386, ppid=1, args='/bin/dbus-daemon'
2017-03-01 10:55:30| 3314|processes.py| Killed orphaned process 11386 (/bin/dbus-daemon)


These processes are part of BOINC manager I think. I'll put in another fix for this.
ID: 4752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4753 - Posted: 2 Mar 2017, 9:18:14 UTC - in response to Message 4752.  

2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc'

The BOINC client itself.

2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr'

The BOINC GUI used for monitoring.

dbus*

System specific. Do not kill it.

I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates.



So far the native app works good.

CVMFS

Host 1:
Running /usr/bin/cvmfs_config stat atlas.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.3.3.0 495 507 44956 21781 8 20 2014592 25600001 842 65024 0 50094 99.976 31833 3 http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch http://IP.of.my.localsquid:3128 1

Host 2:
Running /usr/bin/cvmfs_config stat atlas.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.3.3.0 13808 506 35556 21781 9 18 1120900 25600001 998 65024 0 59890 95.0426 436465 3 http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch http://IP.of.my.localsquid:3128 1



ATLAS Simulation v0.30 (mt)

The jobs finish successfully and the logs show result files like HITS.10327233._012475-1983772-14445.pool.root.1.

The app works with a singlecore and a dualcore setting.
It works on both of my hosts.


What should be checked?

- The result files differ in size from 20 k to 50 M.
- There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min).
- The credit reward is only 1/2 (roughly) of the VM´s credit reward.
- What can be done to reduce the initial download to only a few MB?
ID: 4753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 4754 - Posted: 2 Mar 2017, 9:33:30 UTC
Last modified: 2 Mar 2017, 9:37:46 UTC

I picked up this task last night
by accident. Not using he BOINC manager. It only ran for a
few seconds and, on ending, took down the default VNC server
(Vino) which I was using to control the host. Had to restart
everything to regain control so it looks as though it's
killing more than it should. Don't know why it failed to
run - no HITS file. This is Ubuntu 12.04.
ID: 4754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4755 - Posted: 2 Mar 2017, 10:46:55 UTC - in response to Message 4753.  

I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates.


This is used on the Grid because each Grid task runs under a different user id, so at the end the ATLAS pilot can be sure that any of its own processes with PPID=1 are safe to kill

What should be checked?

- The result files differ in size from 20 k to 50 M.


They should be 50MB if the task ran properly

- There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min).


I am running 4-core tasks and since I have run most of the tasks the estimated time is based on that. As far as I know BOINC doesn't adjust the estimated time according to cores, this is why you can get huge credit scores for single-core, because they go way over the estimate (this has been discussed recently on ATLAS@Home message boards)

- The credit reward is only 1/2 (roughly) of the VM´s credit reward.


Credit is always a mystery to me... It's not easy to compare credit from different projects, even if they are running the same tasks because it depends on who is running them, the history of results and so on.

- What can be done to reduce the initial download to only a few MB?


Since for this test we are always running the same task with the same input file, I'll see if I can change some configuration to make it stay on the client.
ID: 4755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4756 - Posted: 2 Mar 2017, 10:50:48 UTC - in response to Message 4754.  

I picked up this task last night
by accident. Not using he BOINC manager. It only ran for a
few seconds and, on ending, took down the default VNC server
(Vino) which I was using to control the host. Had to restart
everything to regain control so it looks as though it's
killing more than it should. Don't know why it failed to
run - no HITS file. This is Ubuntu 12.04.


The failure is probably due to one of our scripts which had a part that didn't work on ubuntu, which I have since fixed. I saw from the output that you run boinc under user "m" - the code for killing orphan doesn't kill anything if the user is "boinc" so in your case it killed all the processes with user "m" and PPID=1. I will think of a better way to protect this in case people run boinc under different usernames.
ID: 4756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4758 - Posted: 2 Mar 2017, 13:22:12 UTC

They should be 50MB if the task ran properly

HITS.* from singlecore tasks have a size of 20 k (-> invalid?).
HITS.* from dualcore tasks have a size of 50 M (-> valid?).

With the same input for all test tasks the result should be always the same.
Any idea (beside "different core numbers" of course :-D)?
ID: 4758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 4759 - Posted: 2 Mar 2017, 13:33:44 UTC

ID: 4759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4760 - Posted: 2 Mar 2017, 14:33:37 UTC - in response to Message 4759.  

ID: 4760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4761 - Posted: 2 Mar 2017, 16:17:39 UTC - in response to Message 4760.  

I checked one log and it looks like another ubuntu compatibility issue (source command not found). I've submitted some more tasks where this is fixed.

I've also now ensured that no processes are killed for anyone, hopefully it works this time.
ID: 4761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 4762 - Posted: 2 Mar 2017, 22:31:23 UTC - in response to Message 4760.  

Did you check your SSL libs?
If not, your CVMFS runs into a timeout and the job stops after a few minutes.

A workaround can be found below in this thread:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=348&postid=4739

I checked the SSL libs and defined the symbolic links.

All tasks stop after 3.5 minutes. For BOINC successful, but validated into error.
Is there something I can do (could try to fetch a file out of the slot before emptying) or can you find something in the returned files, David?
ID: 4762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 4764 - Posted: 3 Mar 2017, 1:24:08 UTC
Last modified: 3 Mar 2017, 1:34:46 UTC

Jobs fail after a few minutes as described below.
Practically no CPU use apart from short bursts by python at intervals.

At the start of the pilotlog.txt file is this:-

2017-03-03 00:06:36|11945|SiteInformat| !!WARNING!!2999!! $X509_CERT_DIR is not set and default location /etc/grid-security/certificates does not exist

which looks suspicious. I also grabbed a copy of runtime_log_err.txt as the filename looked promising, but it appears to be the same as pilotlog.txt.
ID: 4764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Chris Skull

Send message
Joined: 14 Sep 15
Posts: 2
Credit: 7,617
RAC: 0
Message 4765 - Posted: 3 Mar 2017, 9:13:49 UTC

Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ?
ID: 4765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4766 - Posted: 3 Mar 2017, 10:22:48 UTC

I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task.

Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ?


This would be basically what the regular ATLAS vbox app does automatically, so I'm not sure I see any advantage in setting up a VM yourself.
ID: 4766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 4767 - Posted: 3 Mar 2017, 11:10:50 UTC - in response to Message 4725.  

David Cameron wrote:
Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs.

Sorry, but this is not true.

I'm running Atlas@Home on a virtualized Client and it is Number 11 in the TOP-Hosts :-)

This is called "nested Virtualisation" and is a little bit tricky to set up, but with VMWare this works really fine as you can see with my Atlas1
ID: 4767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 4768 - Posted: 3 Mar 2017, 12:14:06 UTC - in response to Message 4766.  
Last modified: 3 Mar 2017, 12:37:16 UTC

I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task.

I did 2 tasks. The 2nd still downloaded the 200MB.
The second task however was requested after the first one was already reported ready. Maybe the 200MB sticks when you have at least 1 task still loaded and get a 2nd one.

More important: I have a stderr.txt for you. You're invited to have a look ;)

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314881


The beginning is truncated (by BOINC, I believe),
but from a previous task I somewhere saw: "No LSB modules are available."
Could that be a problem? I have the whole log available of that task.
ID: 4768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4769 - Posted: 3 Mar 2017, 12:37:10 UTC - in response to Message 4768.  

Here is the problem:

2017-03-03 12:02:56| 10847|RunJobUtilit| /bin/sh: 1: Sim_tf.py: not found


It means one of the setup scripts didn't run properly, so the main command to run the simulation is not available.

Can you find a running task and execute by hand

. /var/lib/boinc-client/slots/14//APPS/HEP/ATLAS-19.2.4.9-X86_64-SLC6-GCC47-OPT 1


Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit.
ID: 4769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4770 - Posted: 3 Mar 2017, 12:39:05 UTC - in response to Message 4767.  

David Cameron wrote:
Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs.

Sorry, but this is not true.

I'm running Atlas@Home on a virtualized Client and it is Number 11 in the TOP-Hosts :-)

This is called "nested Virtualisation" and is a little bit tricky to set up, but with VMWare this works really fine as you can see with my Atlas1


Interesting! I knew that it wasn't possible to run VirtualBox inside another VirtualBox, but I didn't know it was possible with VMWare. Do you have a recipe for making this work?
ID: 4770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 4771 - Posted: 3 Mar 2017, 12:48:19 UTC - in response to Message 4769.  

Can you find a running task and execute by hand

. /var/lib/boinc-client/slots/14//APPS/HEP/ATLAS-19.2.4.9-X86_64-SLC6-GCC47-OPT 1


The outcome of that command:
Using AtlasProduction/19.2.4.9 with platform x86_64-slc6-gcc47-opt
        at /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4
manpath: warning: $MANPATH set, ignoring /etc/manpath.config
ERROR:root:code for hash md5 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type md5
ERROR:root:code for hash sha1 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type sha1
ERROR:root:code for hash sha224 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type sha224
ERROR:root:code for hash sha256 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type sha256
ERROR:root:code for hash sha384 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type sha384
ERROR:root:code for hash sha512 was not found.
Traceback (most recent call last):
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module>
    globals()[__func_name] = __get_hash(__func_name)
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor
    raise ValueError('unsupported hash type %s' % name)
ValueError: unsupported hash type sha512


Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit.

I don't have a file Sim_tf.py in that slot (11 in this case)
4 python files available: LFCTools.py RunJob.py VmPeak.py outputList.py
ID: 4771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : New native Linux ATLAS application


©2024 CERN