Message boards :
News :
New native Linux ATLAS application
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
After waiting for my unmetered data period to start, installed cvmfs only to find that there isn't any work... so can't try it out... serves me right. However, running chksetup produces this:- $ cvmfs_config chksetup Warning: failed to access http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch/.cvmfspublished through proxy DIRECT Warning: failed to use Geo-API with cvmfs-stratum-one.cern.ch Is this a problem? Probe works OK:- $ cvmfs_config probe Probing /cvmfs/atlas.cern.ch... OK Probing /cvmfs/atlas-condb.cern.ch... OK Probing /cvmfs/grid.cern.ch... OK Running $ sudo cvmfs_talk host.probe gives:- atlas.cern.ch: unknown command atlas-condb.cern.ch: unknown command grid.cern.ch: unknown command and $ sudo cvmfs_talk host.probe.geo gives:- atlas.cern.ch: Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.atlas.cern.ch) atlas-condb.cern.ch: Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.atlas-condb.cern.ch) grid.cern.ch: Seems like CernVM-FS is not running in /var/lib/cvmfs/shared (not found: /var/lib/cvmfs/shared/cvmfs_io.grid.cern.ch) The three files are there but 0 bytes. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
$ cvmfs_config chksetup ? You may check the config files (typos, ...) and/or your firewall. Execute a "cvmfs_config probe" first to avoid unmount-timeouts. Use the correct syntax (without the ".") OK: cvmfs_talk host probe cvmfs_talk host probe geo Not OK: cvmfs_talk host.probe cvmfs_talk host.probe.geo |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
A couple of hours later it was OK, so whilst it may have been a timeout (didn't think about that, thanks for the reminder). I suspect a couple of other hosts running Atlas with the 200M downloads may have something to do with it (they wait for the free data time, too). That's why I no longer run Atlas regularly. ~$ sudo cvmfs_config probe Probing /cvmfs/atlas.cern.ch... OK Probing /cvmfs/atlas-condb.cern.ch... OK Probing /cvmfs/grid.cern.ch... OK ~$ sudo cvmfs_config chksetup OK ~$ Hope there's work tonight... |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Hi, Thanks for all the feedback! I'll try to answer some of the questions. These tasks are deliberately short, running for a few minutes. They process 5 events at a time whereas normal ATLAS tasks process 50 events. I realise 200MB download for a few mins task is a bit too much to ask... Now that I see some successful tasks I will submit longer running tasks. I saw some tasks marked as successful when in fact the validator did not pass the result, so I need to check what happened there. The quick way to know if your WU is successful is check for a file like HITS.123.xyz.pool.root.1 in the directory listing at the end of the task. I will make a post in the ATLAS forum with updated instructions for cvmfs including setting up the proxy. There are several reasons we are trying this app: firstly removing the need for virtualbox removes the need to install one more piece of software (ok, you need cvmfs instead) and removes a common source of errors. The vboxwrapper has gotten a lot more stable recently but it still causes problems which are very hard to debug. I do not expect a huge performance gain from running natively but even a few percent would be nice. Probably the most important reason is that it allows us to run on resources which are already virtualised. Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Probably the most important reason is that it allows us to run on resources which are already virtualised. Today we cannot run a VM inside another VM, and many LHC computing farms (including CERN itself) run a virtualised infrastructure. A native app would allow us to use the spare capacity of all those CPUs. That's exactly the reason, that I could not run ATLAS on my Linux VM. One small problem: My Linux VM has 14 cores, but a (very) small disk and I cannot run more than a few ATLAS-tasks because of this. lhcathome-dev 27 Feb 11:21:19 Message from server: ATLAS Simulation needs 439.16MB more disk space. You currently have 5282.89 MB available and it needs 5722.05 MB. I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed? Very good point. The 5.6GB is mostly used by the CVMFS cache and operating system stuff inside the image, which of course is not needed in the native app. I've reduced the disk requirement to 1GB for new WU, it could potentially be reduced further but let's see how that goes. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I found that the WU were not running properly on Ubuntu machines due to one of our scripts assuming bash shell (from what I see dash is the default on Ubuntu). I've fixed the script and will send some new WU now. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I found that the WU were not running properly on Ubuntu machines due to one of our scripts assuming bash shell (from what I see dash is the default on Ubuntu). I've fixed the script and will send some new WU now. Task ran for ca 12min and failed to validate. This is Ubuntu 12.04 and bash is installed... never heard of dash. $ bash --version GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu) Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> I'll try another, but after that I must wait until tonight. EDIT. Failed, although it validated; I think I still got the old version but that will have to do for now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
The quick way to know if your WU is successful is check for a file like HITS.123.xyz.pool.root.1 in the directory listing at the end of the task. All tasks I returned were running for 11-12 minutes and mostly validated OK by BOINC. But I looked into the slot where a WU was running and towards the end of the job, I don't saw your mentioned root.1 file. What could be wrong? |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Same here. The jobs stay idle and quit after 10-12 minutes. Sometimes with success (why?), sometimes with validation error. My hosts run opensuse 13.1 with bash 4.2.53 as default shell. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
The problem here is incompatible SSL libraries. Traceback (most recent call last): File "pilot3/pilot.py", line 24, in <module> import glexec_utils File "/home/boinc/BOINC/slots/8/pilot3/glexec_utils.py", line 39, in <module> import CustomEncoder File "/home/boinc/BOINC/slots/8/pilot3/CustomEncoder.py", line 1, in <module> import ATLASSiteInformation File "/home/boinc/BOINC/slots/8/pilot3/ATLASSiteInformation.py", line 10, in <module> import ssl File "/cvmfs/atlas.cern.ch/repo/sw/software/x86_64-slc6-gcc47-opt/19.2.4/sw/lcg/external/Python/2.7.3/x86_64-slc6-gcc47-opt/lib/python2.7/ssl.py", line 60, in <module> import _ssl # if we can't import it, let the error propagate ImportError: libssl.so.10: cannot open shared object file: No such file or directory The python that we are using from CVMFS expects to use the SSL library libssl.so.10 but I guess you have a different version on openSUSE. This is one of the disadvantages of CVMFS - packaging all the required software in there means it restricts which operating systems it can run on. Sometimes with success (why?), sometimes with validation error. I don't understand this either, I will ask the admins. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I got the idea that validation was based on the run time, i.e. if the task didn't crash within (say) 12 mins, it was good to go, as it were with further validation checks done out of sight of BOINC. Doesn't sound right to me but the idea came from somewhere... and it does seem to behave rather like that. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Task ran for ca 12min and failed to validate. This is Ubuntu 12.04 bash is normally the default shell for users. But many scripts use /bin/sh which on Ubuntu is redirected to dash shell. You can see what your system uses with "ls -l /bin/sh". |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
The problem here is incompatible SSL libraries. There must be quite a few versions in use in the different disributions. The system here - which is supported by CVMFS, they say - has libssl.so.0.9.8 and 1.0.0 (unless they're hidden away somewhere) - a long way from 10. Does this mean it's back to VBox? or can we simply add the required version? If so, where should it be placed so that your Python can find it. also... bash is normally the default shell for users. But many scripts use /bin/sh which on Ubuntu is redirected to dash shell. You can see what your system uses with "ls -l /bin/sh". You're right ~$ ls -l /bin/sh lrwxrwxrwx 1 root root 4 Jun 21 2015 /bin/sh -> dash ~$ |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I suppose the required 5.6GB is based on the use of VirtualBox. Is this size still required for native running when CVMFS is already installed? Unfortunately this doesn't work... inside the ATLAS code there is a hard-coded minimum limit of 5GB free disk space. If less than that is available it will fail the task. So I had to put the 6GB limit back... I will see if there is any way to decrease this limit. |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
The problem here is incompatible SSL libraries. ... The following workaround helps: ln -s libssl.so.1.0.0 libssl.so.10 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Unfortunately this doesn't work... inside the ATLAS code there is a hard-coded minimum limit of 5GB free disk space. If less than that is available it will fail the task. So I had to put the 6GB limit back... I will see if there is any way to decrease this limit. OK, not that important at the moment. I've 7.05GB free for BOINC with 1 ATLAS-task running. The 6GB you mentioned is probably 6000000000 bytes = 5722.05MB exactly BOINC mentioned not willing to get new tasks. Are you sure it's in ATLAS code and not <rsc_disk_bound>6000000000.000000</rsc_disk_bound> in BOINC's server config for ATLAS. I had one task running and after 3 minutes into it, the whole boinc client crashed drowning 13 other tasks with it. Only 4 processes keeps on running: sh, time, ARCPilot and python: Very suspicious. After restarting BOINC it happened 2 other times with the same job. At this very moment it happens again with a new task. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
There is a feature of the ATLAS pilot code where it cleans up orphaned processes before exiting. This is useful on the Grid so that no processes are left behind which ATLAS could be charged for. When I first started testing the Linux app I found that the boinc-client process was being killed at the end of the task due to this reason. Since then the pilot code has been changed so that it doesn't kill this process. Can you send me the output of "ps -o pid,ppid,args -u boinc"? Maybe there is something else being killed by the pilot. If the PPID (2nd column) is 1 then the pilot will kill it. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Can you send me the output of "ps -o pid,ppid,args -u boinc"? Maybe there is something else being killed by the pilot. If the PPID (2nd column) is 1 then the pilot will kill it. xxxxxx@rekendoos3:~$ ps -o pid,ppid,args -u boinc PID PPID COMMAND 1253 1 /usr/bin/boinc --check_all_logins --redirectio --dir /var/lib/boinc-client 4613 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4615 4613 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4622 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4624 4622 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4637 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4639 4637 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4645 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4647 4645 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4649 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4651 4649 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4654 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4656 4654 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4663 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4665 4663 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 4675 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 4677 4675 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5020 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5022 5020 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5023 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5025 5023 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5029 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5031 5029 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5032 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5034 5032 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5038 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5040 5038 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 5043 1253 ../../projects/boinc.drugdiscoveryathome.com/wrapper_26014_x86_64-pc-linux-gnu 5045 5043 smina_mol2 --cpu 1 --config conf.txt --ligand smina_mol2_data.sdf --out smina_mol2_out.mol2 |
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
Finally I got 1 job running for several hours. At the end it kills the whole boinc client. I tried this a second time after a computer restart but the result was the same. Seems to be the same error that is described below in this thread. Do you need some logs for analysis? |
©2024 CERN