Thread 'New native Linux ATLAS application'

Author	Message
David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4747 - Posted: 28 Feb 2017, 8:32:33 UTC - in response to Message 4746. Thanks, I think the way the BOINC client works on Ubuntu is different so that's why it's being killed. I will put a patch in for this straight away. ID: 4747 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4750 - Posted: 1 Mar 2017, 11:17:19 UTC Today I tried another job: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=313224 The job finished and uploaded a result but it killed again the running boinc manager. The client itself kept running. ID: 4750 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4752 - Posted: 1 Mar 2017, 20:00:25 UTC - in response to Message 4750. d this in the log: [pre] 2017-03-01 10:55:30\| 3314\|processes.py\| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc' 2017-03-01 10:55:30\| 3314\|processes.py\| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr' 2017-03-01 10:55:30\| 3314\|processes.py\| Found orphan process: pid=11385, ppid=1, args='dbus-launch' 2017-03-01 10:55:30\| 3314\|processes.py\| Killed orphaned process 11385 (dbus-launch) 2017-03-01 10:55:30\| 3314\|processes.py\| Found orphan process: pid=11386, ppid=1, args='/bin/dbus-daemon' 2017-03-01 10:55:30\| 3314\|processes.py\| Killed orphaned process 11386 (/bin/dbus-daemon) [/pre] These processes are part of BOINC manager I think. I'll put in another fix for this. ID: 4752 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4753 - Posted: 2 Mar 2017, 9:18:14 UTC - in response to Message 4752. 2017-03-01 10:55:30\| 3314\|processes.py\| Ignoring BOINC client: pid=2366, ppid=1, args='/home/boinc/BOINC/boinc' The BOINC client itself. 2017-03-01 10:55:30\| 3314\|processes.py\| Ignoring BOINC client: pid=11305, ppid=11270, args='./boincmgr' The BOINC GUI used for monitoring. dbus* System specific. Do not kill it. I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates. So far the native app works good. CVMFS Host 1: Running /usr/bin/cvmfs_config stat atlas.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.3.3.0 495 507 44956 21781 8 20 2014592 25600001 842 65024 0 50094 99.976 31833 3 http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch http://IP.of.my.localsquid:3128 1 Host 2: Running /usr/bin/cvmfs_config stat atlas.cern.ch: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2.3.3.0 13808 506 35556 21781 9 18 1120900 25600001 998 65024 0 59890 95.0426 436465 3 http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch http://IP.of.my.localsquid:3128 1 ATLAS Simulation v0.30 (mt) The jobs finish successfully and the logs show result files like HITS.10327233._012475-1983772-14445.pool.root.1. The app works with a singlecore and a dualcore setting. It works on both of my hosts. What should be checked? - The result files differ in size from 20 k to 50 M. - There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min). - The credit reward is only 1/2 (roughly) of the VM´s credit reward. - What can be done to reduce the initial download to only a few MB? ID: 4753 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 105	Message 4754 - Posted: 2 Mar 2017, 9:33:30 UTC Last modified: 2 Mar 2017, 9:37:46 UTC I picked up this task last night by accident. Not using he BOINC manager. It only ran for a few seconds and, on ending, took down the default VNC server (Vino) which I was using to control the host. Had to restart everything to regain control so it looks as though it's killing more than it should. Don't know why it failed to run - no HITS file. This is Ubuntu 12.04. ID: 4754 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4755 - Posted: 2 Mar 2017, 10:46:55 UTC - in response to Message 4753. I think ppid=1 (only) is not a perfect criterion to look at when you search for kill candidates. This is used on the Grid because each Grid task runs under a different user id, so at the end the ATLAS pilot can be sure that any of its own processes with PPID=1 are safe to kill What should be checked? - The result files differ in size from 20 k to 50 M. They should be 50MB if the task ran properly - There is a huge difference between runtime (singlecore: > 4h) and estimated runtime at app start (50 min). I am running 4-core tasks and since I have run most of the tasks the estimated time is based on that. As far as I know BOINC doesn't adjust the estimated time according to cores, this is why you can get huge credit scores for single-core, because they go way over the estimate (this has been discussed recently on ATLAS@Home message boards) - The credit reward is only 1/2 (roughly) of the VM´s credit reward. Credit is always a mystery to me... It's not easy to compare credit from different projects, even if they are running the same tasks because it depends on who is running them, the history of results and so on. - What can be done to reduce the initial download to only a few MB? Since for this test we are always running the same task with the same input file, I'll see if I can change some configuration to make it stay on the client. ID: 4755 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4756 - Posted: 2 Mar 2017, 10:50:48 UTC - in response to Message 4754. I picked up this task last night by accident. Not using he BOINC manager. It only ran for a few seconds and, on ending, took down the default VNC server (Vino) which I was using to control the host. Had to restart everything to regain control so it looks as though it's killing more than it should. Don't know why it failed to run - no HITS file. This is Ubuntu 12.04. The failure is probably due to one of our scripts which had a part that didn't work on ubuntu, which I have since fixed. I saw from the output that you run boinc under user "m" - the code for killing orphan doesn't kill anything if the user is "boinc" so in your case it killed all the processes with user "m" and PPID=1. I will think of a better way to protect this in case people run boinc under different usernames. ID: 4756 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4758 - Posted: 2 Mar 2017, 13:22:12 UTC They should be 50MB if the task ran properly HITS.* from singlecore tasks have a size of 20 k (-> invalid?). HITS.* from dualcore tasks have a size of 50 M (-> valid?). With the same input for all test tasks the result should be always the same. Any idea (beside "different core numbers" of course :-D)? ID: 4758 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,478 RAC: 78	Message 4759 - Posted: 2 Mar 2017, 13:33:44 UTC I returned 6 tasks after only ~220 seconds elapsed time. I used single core, dual core and 4-cores https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314110 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314108 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314102 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314112 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314109 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314105 All validate error (of course). ID: 4759 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4760 - Posted: 2 Mar 2017, 14:33:37 UTC - in response to Message 4759. I returned 6 tasks after only ~220 seconds elapsed time. I used single core, dual core and 4-cores https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314110 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314108 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314102 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314112 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314109 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314105 All validate error (of course). Did you check your SSL libs? If not, your CVMFS runs into a timeout and the job stops after a few minutes. A workaround can be found below in this thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=348&postid=4739 ID: 4760 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4761 - Posted: 2 Mar 2017, 16:17:39 UTC - in response to Message 4760. I checked one log and it looks like another ubuntu compatibility issue (source command not found). I've submitted some more tasks where this is fixed. I've also now ensured that no processes are killed for anyone, hopefully it works this time. ID: 4761 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,478 RAC: 78	Message 4762 - Posted: 2 Mar 2017, 22:31:23 UTC - in response to Message 4760. Did you check your SSL libs? If not, your CVMFS runs into a timeout and the job stops after a few minutes. A workaround can be found below in this thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=348&postid=4739 I checked the SSL libs and defined the symbolic links. All tasks stop after 3.5 minutes. For BOINC successful, but validated into error. Is there something I can do (could try to fetch a file out of the slot before emptying) or can you find something in the returned files, David? ID: 4762 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 105	Message 4764 - Posted: 3 Mar 2017, 1:24:08 UTC Last modified: 3 Mar 2017, 1:34:46 UTC Jobs fail after a few minutes as described below. Practically no CPU use apart from short bursts by python at intervals. At the start of the pilotlog.txt file is this:- 2017-03-03 00:06:36\|11945\|SiteInformat\| !!WARNING!!2999!! $X509_CERT_DIR is not set and default location /etc/grid-security/certificates does not exist which looks suspicious. I also grabbed a copy of runtime_log_err.txt as the filename looked promising, but it appears to be the same as pilotlog.txt. ID: 4764 · Rating: 0 · rate: / Reply Quote

Chris Skull Send message Joined: 14 Sep 15 Posts: 2 Credit: 7,617 RAC: 0	Message 4765 - Posted: 3 Mar 2017, 9:13:49 UTC Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ? ID: 4765 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4766 - Posted: 3 Mar 2017, 10:22:48 UTC I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task. Maybe you can provide a pre-configured VM so people who have not that Linux skills or people who don't have a Linux host at all can just run a VM, open pre-installed Boinc manager and add this project with their own account... ? This would be basically what the regular ATLAS vbox app does automatically, so I'm not sure I see any advantage in setting up a VM yourself. ID: 4766 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 158 Credit: 2,926,146 RAC: 1,883	Message 4767 - Posted: 3 Mar 2017, 11:10:50 UTC - in response to Message 4725. David Cameron wrote: Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. Sorry, but this is not true. I'm running Atlas@Home on a virtualized Client and it is Number 11 in the TOP-Hosts :-) This is called "nested Virtualisation" and is a little bit tricky to set up, but with VMWare this works really fine as you can see with my Atlas1 ID: 4767 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,478 RAC: 78	Message 4768 - Posted: 3 Mar 2017, 12:14:06 UTC - in response to Message 4766. Last modified: 3 Mar 2017, 12:37:16 UTC I've made a new version of the app with way more debugging so we should be able to see from the stderr much more information. I've also tried to remove remaining "source" commands that the Ubuntu dash shell doesn't like. Now the "sticky" flag is used for the input root file so you shouldn't have to re-download it for every task. I did 2 tasks. The 2nd still downloaded the 200MB. The second task however was requested after the first one was already reported ready. Maybe the 200MB sticks when you have at least 1 task still loaded and get a 2nd one. More important: I have a stderr.txt for you. You're invited to have a look ;) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=314881 The beginning is truncated (by BOINC, I believe), but from a previous task I somewhere saw: "No LSB modules are available." Could that be a problem? I have the whole log available of that task. ID: 4768 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4769 - Posted: 3 Mar 2017, 12:37:10 UTC - in response to Message 4768. Here is the problem: 2017-03-03 12:02:56\| 10847\|RunJobUtilit\| /bin/sh: 1: Sim_tf.py: not found It means one of the setup scripts didn't run properly, so the main command to run the simulation is not available. Can you find a running task and execute by hand . /var/lib/boinc-client/slots/14//APPS/HEP/ATLAS-19.2.4.9-X86_64-SLC6-GCC47-OPT 1 Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit. ID: 4769 · Rating: 0 · rate: / Reply Quote

David Cameron Project administrator Project developer Project tester Project scientist Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0	Message 4770 - Posted: 3 Mar 2017, 12:39:05 UTC - in response to Message 4767. David Cameron wrote: Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. Sorry, but this is not true. I'm running Atlas@Home on a virtualized Client and it is Number 11 in the TOP-Hosts :-) This is called "nested Virtualisation" and is a little bit tricky to set up, but with VMWare this works really fine as you can see with my Atlas1 Interesting! I knew that it wasn't possible to run VirtualBox inside another VirtualBox, but I didn't know it was possible with VMWare. Do you have a recipe for making this work? ID: 4770 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,478 RAC: 78	Message 4771 - Posted: 3 Mar 2017, 12:48:19 UTC - in response to Message 4769. REPLACE_PRE_CODE BUG text: Can you find a running task and execute by hand [code]. /var/lib/boinc-client/slots/14//APPS/HEP/ATLAS-19.2.4.9-X86_64-SLC6-GCC47-OPT 1[/code] The outcome of that command: [pre] Using AtlasProduction/19.2.4.9 with platform x86_64-slc6-gcc47-opt at /cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4 manpath: warning: $MANPATH set, ignoring /etc/manpath.config ERROR:root:code for hash md5 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type md5 ERROR:root:code for hash sha1 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha1 ERROR:root:code for hash sha224 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha224 ERROR:root:code for hash sha256 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha256 ERROR:root:code for hash sha384 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha384 ERROR:root:code for hash sha512 was not found. Traceback (most recent call last): File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 139, in <module> globals()[__func_name] = __get_hash(__func_name) File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/hashlib.py", line 91, in __get_builtin_constructor raise ValueError('unsupported hash type %s' % name) ValueError: unsupported hash type sha512 [/pre] Then see if you can run Sim_tf.py? If you give no arguments it will print a bunch of messages then exit. I don't have a file Sim_tf.py in that slot (11 in this case) 4 python files available: LFCTools.py RunJob.py VmPeak.py outputList.py ID: 4771 · Rating: 0 · rate: / Reply Quote

Development for LHC@home