Message boards : News : New native Linux ATLAS application
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 4718 - Posted: 25 Feb 2017, 1:35:46 UTC
Last modified: 25 Feb 2017, 2:04:46 UTC

After waiting for my unmetered data period to start, installed cvmfs only to find that there isn't any work... so can't try it out... serves me right.
However, running chksetup produces this:-

$ cvmfs_config chksetup
Warning: failed to access http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch/.cvmfspublished through proxy DIRECT
Warning: failed to use Geo-API with cvmfs-stratum-one.cern.ch


Is this a problem?

Probe works OK:-

$ cvmfs_config probe
Probing /cvmfs/atlas.cern.ch... OK
Probing /cvmfs/atlas-condb.cern.ch... OK
Probing /cvmfs/grid.cern.ch... OK


Running $ sudo cvmfs_talk host.probe
gives:-
atlas.cern.ch:
unknown command
atlas-condb.cern.ch:
unknown command
grid.cern.ch:
unknown command


and

$ sudo cvmfs_talk host.probe.geo
gives:-
atlas.cern.ch:
Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.atlas.cern.ch)

atlas-condb.cern.ch:
Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.atlas-condb.cern.ch)

grid.cern.ch:
Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.grid.cern.ch)


The three files are there but 0 bytes.
ID: 4718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 4719 - Posted: 25 Feb 2017, 8:32:17 UTC - in response to Message 4718.  

$ cvmfs_config chksetup
Warning: failed to access http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch/.cvmfspublished through proxy DIRECT
Warning: failed to use Geo-API with cvmfs-stratum-one.cern.ch

?
You may check the config files (typos, ...) and/or your firewall.




Execute a "cvmfs_config probe" first to avoid unmount-timeouts.
Use the correct syntax (without the ".")

OK:
cvmfs_talk host probe
cvmfs_talk host probe geo

Not OK:
cvmfs_talk host.probe
cvmfs_talk host.probe.geo
ID: 4719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 4721 - Posted: 25 Feb 2017, 12:06:55 UTC - in response to Message 4719.  
Last modified: 25 Feb 2017, 12:20:42 UTC

A couple of hours later it was OK, so whilst it may have been a timeout
(didn't think about that, thanks for the reminder).
I suspect a couple of other hosts running Atlas with the 200M downloads
may have something to do with it (they wait for the free data time, too).
That's why I no longer run Atlas regularly.

~$ sudo cvmfs_config probe
Probing /cvmfs/atlas.cern.ch... OK
Probing /cvmfs/atlas-condb.cern.ch... OK
Probing /cvmfs/grid.cern.ch... OK
~$ sudo cvmfs_config chksetup
OK
~$


Hope there's work tonight...
ID: 4721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4725 - Posted: 27 Feb 2017, 9:08:04 UTC

Hi,

Thanks for all the feedback! I'll try to answer some of the questions.

These tasks are deliberately short, running for a few minutes. They process 5 events at a time whereas normal ATLAS tasks process 50 events. I realise 200MB download for a few mins task is a bit too much to ask... Now that I see some successful tasks I will submit longer running tasks.

I saw some tasks marked as successful when in fact the validator did not pass the result, so I need to check what happened there. The quick way to know if your WU is successful is check for a file like HITS.123.xyz.pool.root.1 in the directory listing at the end of the task.

I will make a post in the ATLAS forum with updated instructions for cvmfs including setting up the proxy.

There are several reasons we are trying this app: firstly removing the need for virtualbox removes the need to install one more piece of software (ok, you need cvmfs instead) and removes a common source of errors. The vboxwrapper has gotten a lot more stable recently but it still causes problems which are very hard to debug. I do not expect a huge performance gain from running natively but even a few percent would be nice.

Probably the most important reason is that it allows us to run on resources which are already virtualised. Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs.
ID: 4725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 4728 - Posted: 27 Feb 2017, 10:30:31 UTC - in response to Message 4725.  

Probably the most important reason is that it allows us to run on resources which are already virtualised. Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs.

That's exactly the reason, that I could not run ATLAS on my Linux VM.
One small problem: My Linux VM has 14 cores, but a (very) small disk and I cannot run more than a few ATLAS-tasks because of this.

lhcathome-dev 27 Feb 11:21:19 Message from server: ATLAS Simulation needs 439.16MB more disk space. You currently have 5282.89 MB available and it needs 5722.05 MB.

I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed?
ID: 4728 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4729 - Posted: 27 Feb 2017, 11:08:07 UTC - in response to Message 4728.  

I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed?


Very good point. The 5.6GB is mostly used by the CVMFS cache and operating system stuff inside the image, which of course is not needed in the native app. I've reduced the disk requirement to 1GB for new WU, it could potentially be reduced further but let's see how that goes.
ID: 4729 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4730 - Posted: 27 Feb 2017, 11:11:59 UTC

I found that the WU were not running properly on Ubuntu machines due to one of our scripts assuming bash shell (from what I see dash is the default on Ubuntu). I've fixed the script and will send some new WU now.
ID: 4730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 4731 - Posted: 27 Feb 2017, 11:36:56 UTC - in response to Message 4730.  
Last modified: 27 Feb 2017, 12:30:45 UTC

I found that the WU were not running properly on Ubuntu machines due to one of our scripts assuming bash shell (from what I see dash is the default on Ubuntu). I've fixed the script and will send some new WU now.

Task ran for ca 12min and failed to validate. This is Ubuntu 12.04
and bash is installed... never heard of dash.

$ bash --version
GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>


I'll try another, but after that I must wait until tonight.

EDIT. Failed, although it validated; I think I still got the old version but that will have to do for now.
ID: 4731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 4732 - Posted: 27 Feb 2017, 12:10:06 UTC - in response to Message 4725.  

The quick way to know if your WU is successful is check for a file like HITS.123.xyz.pool.root.1 in the directory listing at the end of the task.

All tasks I returned were running for 11-12 minutes and mostly validated OK by BOINC.
But I looked into the slot where a WU was running and towards the end of the job, I don't saw your mentioned root.1 file.
What could be wrong?
ID: 4732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 4733 - Posted: 27 Feb 2017, 12:12:25 UTC - in response to Message 4731.  

Same here.
The jobs stay idle and quit after 10-12 minutes.
Sometimes with success (why?), sometimes with validation error.

My hosts run opensuse 13.1 with bash 4.2.53 as default shell.
ID: 4733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4734 - Posted: 27 Feb 2017, 12:35:42 UTC - in response to Message 4733.  

The problem here is incompatible SSL libraries.

Traceback (most recent call last):
  File "pilot3/pilot.py", line 24, in <module>
    import glexec_utils
  File "/home/boinc/BOINC/slots/8/pilot3/glexec_utils.py", line 39, in <module>
    import CustomEncoder
  File "/home/boinc/BOINC/slots/8/pilot3/CustomEncoder.py", line 1, in <module>
    import ATLASSiteInformation
  File "/home/boinc/BOINC/slots/8/pilot3/ATLASSiteInformation.py", line 10, in <module>
    import ssl
  File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/ssl.py", line 60, in <module>
    import _ssl             # if we can't import it, let the error propagate
ImportError: libssl.so.10: cannot open shared object file: No such file or directory


The python that we are using from CVMFS expects to use the SSL library libssl.so.10 but I guess you have a different version on openSUSE. This is one of the disadvantages of CVMFS - packaging all the required software in there means it restricts which operating systems it can run on.

Sometimes with success (why?), sometimes with validation error.


I don't understand this either, I will ask the admins.
ID: 4734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 4735 - Posted: 27 Feb 2017, 12:37:55 UTC - in response to Message 4733.  
Last modified: 27 Feb 2017, 12:39:33 UTC


Sometimes with success (why?), sometimes with validation error.
.

I got the idea that validation was based on the run time,
i.e. if the task didn't crash within (say) 12 mins, it was good to go, as it were
with further validation checks done out of sight of BOINC.
Doesn't sound right to me but the idea came from somewhere... and it does seem to behave rather like that.
ID: 4735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4736 - Posted: 27 Feb 2017, 12:52:09 UTC - in response to Message 4731.  

Task ran for ca 12min and failed to validate. This is Ubuntu 12.04
and bash is installed... never heard of dash.

$ bash --version
GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later



bash is normally the default shell for users. But many scripts use /bin/sh which on Ubuntu is redirected to dash shell. You can see what your system uses with "ls -l /bin/sh".
ID: 4736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 4737 - Posted: 27 Feb 2017, 13:08:16 UTC - in response to Message 4734.  
Last modified: 27 Feb 2017, 13:17:16 UTC

The problem here is incompatible SSL libraries.
The python that we are using from CVMFS expects to use the SSL library libssl.so.10 but I guess you have a different version on openSUSE. This is one of the disadvantages of CVMFS - packaging all the required software in there means it restricts which operating systems it can run on.

There must be quite a few versions in use in the different disributions. The system here - which is supported by CVMFS, they say - has libssl.so.0.9.8 and 1.0.0 (unless they're hidden away somewhere) - a long way from 10. Does this mean it's back to VBox? or can we simply add the required version? If so, where should it be placed so that your Python can find it.

also...
bash is normally the default shell for users. But many scripts use /bin/sh which on Ubuntu is redirected to dash shell. You can see what your system uses with "ls -l /bin/sh".

You're right

~$ ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Jun 21 2015 /bin/sh -> dash

~$
ID: 4737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4738 - Posted: 27 Feb 2017, 13:52:44 UTC - in response to Message 4729.  

I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed?


Very good point. The 5.6GB is mostly used by the CVMFS cache and operating system stuff inside the image, which of course is not needed in the native app. I've reduced the disk requirement to 1GB for new WU, it could potentially be reduced further but let's see how that goes.


Unfortunately this doesn't work... inside the ATLAS code there is a hard-coded minimum limit of 5GB free disk space. If less than that is available it will fail the task. So I had to put the 6GB limit back... I will see if there is any way to decrease this limit.
ID: 4738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 4739 - Posted: 27 Feb 2017, 13:52:54 UTC - in response to Message 4734.  

The problem here is incompatible SSL libraries. ...


The following workaround helps:

ln -s libssl.so.1.0.0 libssl.so.10
ln -s libcrypto.so.1.0.0 libcrypto.so.10
ID: 4739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 4741 - Posted: 27 Feb 2017, 14:41:39 UTC - in response to Message 4738.  

Unfortunately this doesn't work... inside the ATLAS code there is a hard-coded minimum limit of 5GB free disk space. If less than that is available it will fail the task. So I had to put the 6GB limit back... I will see if there is any way to decrease this limit.

OK, not that important at the moment. I've 7.05GB free for BOINC with 1 ATLAS-task running.

The 6GB you mentioned is probably 6000000000 bytes = 5722.05MB exactly BOINC mentioned not willing to get new tasks.
Are you sure it's in ATLAS code and not <rsc_disk_bound>6000000000.000000</rsc_disk_bound> in BOINC's server config for ATLAS.

I had one task running and after 3 minutes into it, the whole boinc client crashed drowning 13 other tasks with it.
Only 4 processes keeps on running: sh, time, ARCPilot and python: Very suspicious. After restarting BOINC it happened 2 other times with the same job.

At this very moment it happens again with a new task.
ID: 4741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 4744 - Posted: 27 Feb 2017, 21:23:22 UTC - in response to Message 4741.  

There is a feature of the ATLAS pilot code where it cleans up orphaned processes before exiting. This is useful on the Grid so that no processes are left behind which ATLAS could be charged for. When I first started testing the Linux app I found that the boinc-client process was being killed at the end of the task due to this reason. Since then the pilot code has been changed so that it doesn't kill this process. Can you send me the output of "ps -o pid,ppid,args -u boinc"? Maybe there is something else being killed by the pilot. If the PPID (2nd column) is 1 then the pilot will kill it.
ID: 4744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 4745 - Posted: 27 Feb 2017, 21:43:03 UTC - in response to Message 4744.  
Last modified: 27 Feb 2017, 21:46:21 UTC

Can you send me the output of "ps -o pid,ppid,args -u boinc"? Maybe there is something else being killed by the pilot. If the PPID (2nd column) is 1 then the pilot will kill it.

xxxxxx@rekendoos3:~$ ps -o pid,ppid,args -u boinc
PID PPID COMMAND
1253 1 /usr/bin/boinc --check_all_logins --redirectio --dir /var/lib/boinc-client
4613 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4615 4613 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4622 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4624 4622 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4637 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4639 4637 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4645 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4647 4645 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4649 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4651 4649 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4654 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4656 4654 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4663 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4665 4663 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
4675 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
4677 4675 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5020 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5022 5020 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5023 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5025 5023 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5029 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5031 5029 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5032 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5034 5032 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5038 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5040 5038 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
5043 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu
5045 5043 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2
ID: 4745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 4746 - Posted: 28 Feb 2017, 8:26:47 UTC

Finally I got 1 job running for several hours.
At the end it kills the whole boinc client.
I tried this a second time after a computer restart but the result was the same.
Seems to be the same error that is described below in this thread.

Do you need some logs for analysis?
ID: 4746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : News : New native Linux ATLAS application


©2024 CERN