Message boards :
News :
New native Linux ATLAS application
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Thanks, I think the way the BOINC client works on Ubuntu is different so that's why it's being killed. I will put a patch in for this straight away. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Today I tried another job: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=313224 The job finished and uploaded a result but it killed again the running boinc manager. The client itself kept running. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I found this in the log: 2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc' 2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr' 2017-03-01 10:55:30| 3314|processes.py| Found orphan process: pid=11385, ppid=1, args='dbus-launch' 2017-03-01 10:55:30| 3314|processes.py| Killed orphaned process 11385 (dbus-launch) 2017-03-01 10:55:30| 3314|processes.py| Found orphan process: pid=11386, ppid=1, args='/bin/dbus-daemon' 2017-03-01 10:55:30| 3314|processes.py| Killed orphaned process 11386 (/bin/dbus-daemon) These processes are part of BOINC manager I think. I'll put in another fix for this. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc' The BOINC client itself. 2017-03-01 10:55:30| 3314|processes.py| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr' The BOINC GUI used for monitoring. dbus* System specific. Do not kill it. I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates. So far the native app works good. CVMFS Host 1: ATLAS Simulation v0.30 (mt) The jobs finish successfully and the logs show result files like HITS.10327233._012475-1983772-14445.pool.root.1. The app works with a singlecore and a dualcore setting. It works on both of my hosts. What should be checked? - The result files differ in size from 20 k to 50 M. - There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min). - The credit reward is only 1/2 (roughly) of the VM´s credit reward. - What can be done to reduce the initial download to only a few MB? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I picked up this task last night by accident. Not using he BOINC manager. It only ran for a few seconds and, on ending, took down the default VNC server (Vino) which I was using to control the host. Had to restart everything to regain control so it looks as though it's killing more than it should. Don't know why it failed to run - no HITS file. This is Ubuntu 12.04. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates. This is used on the Grid because each Grid task runs under a different user id, so at the end the ATLAS pilot can be sure that any of its own processes with PPID=1 are safe to kill What should be checked? They should be 50MB if the task ran properly - There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min). I am running 4-core tasks and since I have run most of the tasks the estimated time is based on that. As far as I know BOINC doesn't adjust the estimated time according to cores, this is why you can get huge credit scores for single-core, because they go way over the estimate (this has been discussed recently on ATLAS@Home message boards) - The credit reward is only 1/2 (roughly) of the VM´s credit reward. Credit is always a mystery to me... It's not easy to compare credit from different projects, even if they are running the same tasks because it depends on who is running them, the history of results and so on. - What can be done to reduce the initial download to only a few MB? Since for this test we are always running the same task with the same input file, I'll see if I can change some configuration to make it stay on the client. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I picked up this task last night The failure is probably due to one of our scripts which had a part that didn't work on ubuntu, which I have since fixed. I saw from the output that you run boinc under user "m" - the code for killing orphan doesn't kill anything if the user is "boinc" so in your case it killed all the processes with user "m" and PPID=1. I will think of a better way to protect this in case people run boinc under different usernames. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
They should be 50MB if the task ran properly HITS.* from singlecore tasks have a size of 20 k (-> invalid?). HITS.* from dualcore tasks have a size of 50 M (-> valid?). With the same input for all test tasks the result should be always the same. Any idea (beside "different core numbers" of course :-D)? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
I returned 6 tasks after only ~220 seconds elapsed time. I used single core, dual core and 4-cores https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314110 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314108 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314102 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314112 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314109 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314105 All validate error (of course). |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
I returned 6 tasks after only ~220 seconds elapsed time. I used single core, dual core and 4-cores Did you check your SSL libs? If not, your CVMFS runs into a timeout and the job stops after a few minutes. A workaround can be found below in this thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=348&postid=4739 |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I checked one log and it looks like another ubuntu compatibility issue (source command not found). I've submitted some more tasks where this is fixed. I've also now ensured that no processes are killed for anyone, hopefully it works this time. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Did you check your SSL libs? I checked the SSL libs and defined the symbolic links. All tasks stop after 3.5 minutes. For BOINC successful, but validated into error. Is there something I can do (could try to fetch a file out of the slot before emptying) or can you find something in the returned files, David? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Jobs fail after a few minutes as described below. Practically no CPU use apart from short bursts by python at intervals. At the start of the pilotlog.txt file is this:- 2017-03-03 00:06:36|11945|SiteInformat| !!WARNING!!2999!! $X509_CERT_DIR is not set and default location /etc/grid-security/certificates does not exist which looks suspicious. I also grabbed a copy of runtime_log_err.txt as the filename looked promising, but it appears to be the same as pilotlog.txt. |
Send message Joined: 14 Sep 15 Posts: 2 Credit: 7,617 RAC: 0 |
Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task. Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ? This would be basically what the regular ATLAS vbox app does automatically, so I'm not sure I see any advantage in setting up a VM yourself. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
David Cameron wrote: Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. Sorry, but this is not true. I'm running Atlas@Home on a virtualized Client and it is Number 11 in the TOP-Hosts :-) This is called "nested Virtualisation" and is a little bit tricky to set up, but with VMWare this works really fine as you can see with my Atlas1 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task. I did 2 tasks. The 2nd still downloaded the 200MB. The second task however was requested after the first one was already reported ready. Maybe the 200MB sticks when you have at least 1 task still loaded and get a 2nd one. More important: I have a stderr.txt for you. You're invited to have a look ;) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314881 The beginning is truncated (by BOINC, I believe), but from a previous task I somewhere saw: "No LSB modules are available." Could that be a problem? I have the whole log available of that task. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Here is the problem: 2017-03-03 12:02:56| 10847|RunJobUtilit| /bin/sh: 1: Sim_tf.py: not found It means one of the setup scripts didn't run properly, so the main command to run the simulation is not available. Can you find a running task and execute by hand . /var/lib/boinc-client/slots/14//APPS/HEP/ATLAS-19.2.4.9-X86_64-SLC6-GCC47-OPT 1 Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
David Cameron wrote:Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. Interesting! I knew that it wasn't possible to run VirtualBox inside another VirtualBox, but I didn't know it was possible with VMWare. Do you have a recipe for making this work? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Can you find a running task and execute by hand The outcome of that command: Using AtlasProduction/19.2.4.9 with platform x86_64-slc6-gcc47-opt at /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4 manpath: warning: $MANPATH set, ignoring /etc/manpath.config ERROR:root:code for hash md5 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type md5 ERROR:root:code for hash sha1 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha1 ERROR:root:code for hash sha224 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha224 ERROR:root:code for hash sha256 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha256 ERROR:root:code for hash sha384 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha384 ERROR:root:code for hash sha512 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha512 Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit. I don't have a file Sim_tf.py in that slot (11 in this case) 4 python files available: LFCTools.py RunJob.py VmPeak.py outputList.py |
©2024 CERN