Message boards : Theory Application : New Native App - Linux Only
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5826 - Posted: 11 Feb 2019, 9:19:10 UTC - in response to Message 5808.  

I've no idea what happened here.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752133
This is Centos7 which (apparently) must have Python 2.7. I've installed Python 3.6 (as well) which fixed this:-
14:34:52 (4883): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.13 ()
/usr/bin/env: python3: No such file or directory

but it still doesn't work...


For whatever reason the python3 command doesn't exist after installing Python 3.6 it is python36. I have managed to get it working on CentOS7 with some fiddling but will try to improve the setup so this is not necessary.
ID: 5826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5827 - Posted: 11 Feb 2019, 9:22:20 UTC - in response to Message 5809.  
Last modified: 11 Feb 2019, 9:22:34 UTC

From this:-
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752133
came this:-
cranky-0.0.13 INFO: Running Container 'runc'.
nsenter: failed to unshare user namespace: Invalid argument
container_linux.go:336: starting container process caused "process_linux.go:279: running exec setns process for init caused \"exit status 39\""
cranky-0.0.13 ERROR: Container 'runc' failed.

For those more knowledgeable than I, there may be some ideas here:-
https://coderwall.com/p/s_ydlq/using-user-namespaces-on-docker
I've followed the instructions to enable namespaces on the kernel, but now there's no work to try it out... and it's Friday.


The issue is that although username spaces is enabled, you can check with the following:
grep CONFIG_USER_NS /boot/config-$(uname -r)
CONFIG_USER_NS=y

By default the maximum number of namespaces is set to 0. To fix this run:
echo 640 > /proc/sys/user/max_user_namespaces
ID: 5827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5828 - Posted: 11 Feb 2019, 9:24:20 UTC - in response to Message 5822.  

I have cloned and built the lastest runc code (from https://github.com/opencontainers/runc) on Debian Stretch, and the task produces this error message:

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
09:10:56 (2772): wrapper (7.7.26015): starting
09:10:56 (2772): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.13 ()
cranky-0.0.13 INFO: Starting
cranky-0.0.13 INFO: Detected Theory App
cranky-0.0.13 INFO: Checking CVMFS.
cranky-0.0.13 INFO: Checking runc.
cranky-0.0.13 ERROR: 'runc spec version < 1.1
09:11:02 (2772): cranky exited; CPU time 0.188000
09:11:02 (2772): app exit status: 0x1
09:11:02 (2772): called boinc_finish(195)

</stderr_txt>
]]>

This is the corresponding task:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752234

The command "runc --version" gives the output: "runc version spec: 1.0.1-dev"

Am I doing something wrong or is the mistake on the application side?


Application side, I will fix this.
ID: 5828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5829 - Posted: 11 Feb 2019, 9:24:58 UTC - in response to Message 5823.  

cranky-0.0.13 ERROR: 'runc spec version < 1.1

The app requires runc to be at least version 1.1.

Have the same situation with opensuse where the standard package is <1.1.


Am testing with opensuse now.
ID: 5829 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5831 - Posted: 11 Feb 2019, 10:56:48 UTC - in response to Message 5828.  

[Application side, I will fix this.


ok, I have removed the need for python3, and the runc command is now taken from CVMFS. The guide for CentOS7 is here.

Please give it a try and let me know how it goes.

Cheers,

Laurence
ID: 5831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 18 Aug 15
Posts: 14
Credit: 125,335
RAC: 55
Message 5832 - Posted: 11 Feb 2019, 12:03:47 UTC - in response to Message 5831.  

I managed to snag another one of these tasks this morning. It has been running for 58 minutes. Even though it is allocated 2 CPUs, it looks like it is only using 1 CPU.

Please let me know if you need more information.
ID: 5832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,900,914
RAC: 5,133
Message 5833 - Posted: 11 Feb 2019, 12:23:56 UTC - in response to Message 5831.  
Last modified: 11 Feb 2019, 12:25:29 UTC

[Application side, I will fix this.


ok, I have removed the need for python3, and the runc command is now taken from CVMFS. The guide for CentOS7 is here.

Please give it a try and let me know how it goes.

Cheers,

Laurence

Before changing SL69 to SL610 or SL76.
This is without Python3 and runc error with SL69.
<core_client_version>7.5.1</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
13:19:19 (27657): wrapper (7.7.26015): starting
13:19:19 (27657): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.14 ()
cranky-0.0.14 INFO: Starting
Traceback (most recent call last):
File "../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.14", line 157, in <module>
logging.info("Detected {} App".format(app))
ValueError: zero length field name in format
13:19:20 (27657): cranky exited; CPU time 0.047992
13:19:20 (27657): app exit status: 0x1
13:19:20 (27657): called boinc_finish(195)
</stderr_txt>
]]>
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752250
ID: 5833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 5834 - Posted: 11 Feb 2019, 12:55:19 UTC - in response to Message 5827.  
Last modified: 11 Feb 2019, 13:30:41 UTC

By default the maximum number of namespaces is set to 0. To fix this run:
echo 640 > /proc/sys/user/max_user_namespaces

I think that this only applies to kernel versions 644 and up, mine is 514 so should be OK.
I had tried this:-

sudo sysctl user.max_user_namespaces=15000

and got .".. /proc/sys/user/max_user_namespaces "no such file or directory" so .assumed it didn't apply - yet. .
From here:-
For background information, see this bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1340238

In short, on RHEL 7.3 and later starting with kernel version kernel-3.10.0-644.el7 and later, (viewable with grubby --default-kernel) if you're using the --userns-remap parameter in docker config to change the container namespace, you must also ensure that value found in /proc/sys/user/max_user_namespaces is set to a value greater than zero. Zero is its default setting

To reset this value, you can use a call like:

sysctl user.max_user_namespaces=15000


runc here is version 1.0.0. If the setup is to be done from CVMFS do I need to remove the installed version?

I don't see why I can't get "python3" to go to "python3.6" but if it's to be fixed that's great --- lots to learn.

The host is running an Atlas native at the moment so I'll need to wait for it to finish... the job starts again from the beginning if I even think about suspending it... (that's something else that needs sorting out...)
ID: 5834 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5835 - Posted: 11 Feb 2019, 13:04:02 UTC - in response to Message 5832.  

I managed to snag another one of these tasks this morning. It has been running for 58 minutes. Even though it is allocated 2 CPUs, it looks like it is only using 1 CPU.

Please let me know if you need more information.


This may be related to it starting two processes.
ID: 5835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 261
Message 5836 - Posted: 11 Feb 2019, 13:08:37 UTC

1st task finished successfully on my opensuse 42.3 system:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752265



Environment info:

python --version
Python 2.7.13

python3 --version
Python 3.4.6

runc --version
runc version spec: 1.0.0-rc2-dev


2nd task is currently in progress.
ID: 5836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5837 - Posted: 11 Feb 2019, 13:17:59 UTC - in response to Message 5833.  
Last modified: 11 Feb 2019, 13:18:10 UTC


Before changing SL69 to SL610 or SL76.

Thanks for testing but it will not work on SL6 due to the lack of user namespaces. I would recommend using CentOS 7 rather than SL7.
ID: 5837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5838 - Posted: 11 Feb 2019, 13:26:11 UTC - in response to Message 5834.  


I think that this only applies to kernel versions 644 and up, mine is 514 so should be OK.
I had tried this:-

sudo sysctl user.max_user_namespaces=15000

and got .".. /proc/sys/user/max_user_namespaces "no such file or directory" so .assumed it didn't apply - yet. .

Thanks, that is interesting information.

runc here is version 1.0.0. If the setup is to be done from CVMFS do I need to remove the installed version?

It should not matter if it is already on the system. The binary from CVMFS will be used.

I don't see why I can't get "python3" to go to "python3.6" but if it's to be fixed that's great --- lots to learn.

The python command does not seem to be consistent between different operating systems. I am relaying on just python being there now but this is not true for Ubuntu where the default seems to be python3.
ID: 5838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5839 - Posted: 11 Feb 2019, 13:28:39 UTC - in response to Message 5836.  

1st task finished successfully on my opensuse 42.3 system:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752265

Great!


runc --version
runc version spec: 1.0.0-rc2-dev

This we don't need anymore.

I will add the OpenSuse guide here later once I have managed to run it myself from a freshly installed machine.
ID: 5839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyllic

Send message
Joined: 10 Mar 17
Posts: 40
Credit: 108,345
RAC: 0
Message 5840 - Posted: 11 Feb 2019, 13:29:46 UTC
Last modified: 11 Feb 2019, 13:30:09 UTC

Now it shows the user namespace error also on Debian Stretch:

cranky-0.0.14 INFO: Starting
cranky-0.0.14 INFO: Detected Theory App
cranky-0.0.14 INFO: Checking CVMFS.
cranky-0.0.14 INFO: Checking runc.
cranky-0.0.14 INFO: Creating the filesystem.
cranky-0.0.14 INFO: Using /cvmfs/cernvm-prod.cern.ch/cvm3
cranky-0.0.14 INFO: Updating config.json.
cranky-0.0.14 INFO: Running Container 'runc'.
nsenter: failed to unshare user namespace: Operation not permitted
container_linux.go:336: starting container process caused "process_linux.go:279: running exec setns process for init caused \"exit status 39\""
cranky-0.0.14 ERROR: Container 'runc' failed.
13:52:12 (2366): cranky exited; CPU time 0.212000
13:52:12 (2366): app exit status: 0xce
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2752261

The "/proc/sys/user/max_user_namespace" file shows "31300". Also, the flag in the .config file used for kernel building shows "CONFIG_USER_NS=y". Do you know any way how to fix this problem on debian stretch (kernel 4.9)?
ID: 5840 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5841 - Posted: 11 Feb 2019, 13:34:15 UTC - in response to Message 5840.  
Last modified: 11 Feb 2019, 13:34:48 UTC

Now it shows the user namespace error also on Debian Stretch:

The "/proc/sys/user/max_user_namespace" file shows "31300". Also, the flag in the .config file used for kernel building shows "CONFIG_USER_NS=y". Do you know any way how to fix this problem on debian stretch (kernel 4.9)?


Try this command for testing:
unshare -U /bin/bash
ID: 5841 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 261
Message 5842 - Posted: 11 Feb 2019, 13:42:55 UTC

I guess we are running real scientific simulations, e.g. Herwig++, rather than dummy subtasks, right?
Hence the different runtimes of the tasks.

Could the stderr.txt be made more verbose to show the scientific app?
Would be helpful in case of errors.
ID: 5842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gyllic

Send message
Joined: 10 Mar 17
Posts: 40
Credit: 108,345
RAC: 0
Message 5843 - Posted: 11 Feb 2019, 13:44:40 UTC - in response to Message 5841.  

Try this command for testing:
unshare -U /bin/bash
The output is
unshare: unshare failed: Operation not permitted

running it with sudo gives a new line in the terminal with
nobody@debian:
with "nobody" not being my normal user
ID: 5843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 5844 - Posted: 11 Feb 2019, 14:42:35 UTC - in response to Message 5838.  
Last modified: 11 Feb 2019, 15:16:17 UTC

]
The python command does not seem to be consistent between different operating systems. I am relaying on just python being there now but this is not true for Ubuntu where the default seems to be python3.

On Centos7 python gets you python2.7 which the rest of the system needs. After simply installing python3.X
you seem to need "python3.x" note the dot, "python3x" doesn't work, not for me, anyway.

/usr/bin/env python
Python 2.7.5 (default, Nov 6 2016, 00:28:07)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

/usr/bin/env python3
/usr/bin/env: python3: No such file or directory

/usr/bin/env python3.6
Python 3.6.7 (default, Dec 5 2018, 15:02:05)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Type "help", "copyright", "credits" or "license" for more information.

.... trying to get an app_config to get this host to only download one task at a time, but I can't even get that right now..... says "missing start tag", looks OK to me; aaaaarrrgggh.
ID: 5844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 18 Aug 15
Posts: 14
Credit: 125,335
RAC: 55
Message 5845 - Posted: 11 Feb 2019, 15:33:13 UTC

m wrote:
.... trying to get an app_config to get this host to only download one task at a time, but I can't even get that right now..... says "missing start tag"

An app_config will not control the number of tasks that get downloaded, it controls the number of tasks that run concurrently. If you want to limit the number of tasks that get downloaded, use the "Max # Jobs" setting in your project preferences.

If you still want to use an app_config for another purpose, post it here and maybe someone else here can help debug.
ID: 5845 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 5846 - Posted: 11 Feb 2019, 16:19:41 UTC - in response to Message 5842.  

I guess we are running real scientific simulations, e.g. Herwig++, rather than dummy subtasks, right?
Hence the different runtimes of the tasks.

Could the stderr.txt be made more verbose to show the scientific app?
Would be helpful in case of errors.

You can find the job log in a subdirectory of the slot directory. Doing a hard link should make it stay available after the job ends. If you let me know what info you would like to see, I can make it available.
ID: 5846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Theory Application : New Native App - Linux Only


©2024 CERN