Message boards :
ATLAS Application :
New Experimental ATLAS Application
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
A new experimental ATLAS application following a similar approach to the Theory application, is now available for evaluation. The jobs are real ATLAS jobs but for now output will not be used, so if any of you would actually like to crunch for ATLAS please do so via their production project. http://atlasathome.cern.ch/ For the rest of you who like to provide brain cycles rather than compute cycle, feel free to try it and provide feedback. Don't forget that you can adjust which applications that you would like to run via the project preferences. http://lhcathomedev.cern.ch/vLHCathome-dev/prefs.php?subset=project |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No Atlas tasks available. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Patience. We are just checking a few things before sending out some tasks. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The first 50 tasks are available, get them while you can. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
First impressions: Show graphics-no output. Dominant process athena.py (res mem 1.3g? virt mem 2149m?) 2GB mem used to the full, some swapfile as well(currently 85MB). Mem seems a bit tight. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
The ATLAS production project needs 2241MB memory for the VM. In the past it's extended to that from 2048MB for a reason, I suppose. |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
Show graphics-no output. Have a look at localhost:port/logs Mine started doing this every 2 minutes: 04/18/16 13:14:43 ** condor_starter (CONDOR_STARTER) STARTING UP 04/18/16 13:14:43 ** /usr/sbin/condor_starter 04/18/16 13:14:43 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 04/18/16 13:14:43 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 04/18/16 13:14:43 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $ 04/18/16 13:14:43 ** $CondorPlatform: x86_64_RedHat6 $ 04/18/16 13:14:43 ** PID = 31388 04/18/16 13:14:43 ** Log last touched 4/18 13:14:42 04/18/16 13:14:43 ****************************************************** 04/18/16 13:14:43 Using config source: /etc/condor/condor_config 04/18/16 13:14:43 Using local config sources: 04/18/16 13:14:43 /etc/condor/config.d/10_security.config 04/18/16 13:14:43 /etc/condor/config.d/14_network.config 04/18/16 13:14:43 /etc/condor/config.d/20_workernode.config 04/18/16 13:14:43 /etc/condor/config.d/30_lease.config 04/18/16 13:14:43 /etc/condor/config.d/35_atlas.config 04/18/16 13:14:43 /etc/condor/config.d/40_ccb.config 04/18/16 13:14:43 /etc/condor/condor_config.local 04/18/16 13:14:43 Daemon Log is logging: D_ALWAYS D_ERROR 04/18/16 13:14:43 DaemonCore: command socket at <10.0.2.15:33345?noUDP> 04/18/16 13:14:43 DaemonCore: private command socket at <10.0.2.15:33345> 04/18/16 13:14:43 ERROR: Could not open canonicalization file '/etc/condor/certificate_mapfile' (No such file or directory) 04/18/16 13:14:44 CCBListener: heartbeat disabled because interval is configured to be 0 04/18/16 13:14:44 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#9181 04/18/16 13:14:44 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=6941_4ff3_269122> 04/18/16 13:14:44 Submitting machine is "alicondorce01.cern.ch" 04/18/16 13:14:45 setting the orig job name in starter 04/18/16 13:14:45 setting the orig job iwd in starter 04/18/16 13:14:45 Job has WantIOProxy=true 04/18/16 13:14:45 Initialized IO Proxy. 04/18/16 13:14:45 Done setting resource limits 04/18/16 13:14:45 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:54469>, fd is 14 04/18/16 13:14:45 Buf::write(): condor_write() failed 04/18/16 13:14:45 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:54469> (try 1 of 3): CEDAR:6002:failed to send EOM 04/18/16 13:14:45 File transfer completed successfully. 04/18/16 13:14:46 Job 268269.0 set to execute immediately 04/18/16 13:14:46 Starting a VANILLA universe job with ID: 268269.0 04/18/16 13:14:46 IWD: /var/lib/condor/execute/dir_31388 04/18/16 13:14:46 Output file: /var/lib/condor/execute/dir_31388/_condor_stdout 04/18/16 13:14:46 Error file: /var/lib/condor/execute/dir_31388/_condor_stderr 04/18/16 13:14:46 Renice expr "10" evaluated to 10 04/18/16 13:14:46 Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_31388/condor_exec.exe 04/18/16 13:14:46 Setting job's virtual memory rlimit to 0 megabytes 04/18/16 13:14:46 Running job as user nobody 04/18/16 13:14:46 Create_Process succeeded, pid=31395 04/18/16 13:16:31 Process exited, pid=31395, status=0 04/18/16 13:16:39 Got SIGQUIT. Performing fast shutdown. 04/18/16 13:16:39 ShutdownFast all jobs. 04/18/16 13:16:39 **** condor_starter (condor_STARTER) pid 31388 EXITING WITH STATUS 0 But now its picked up a job thats been running 20mins so far. It seems to be close on available memory, pagefile at 125M and climbing throught the run. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Why 13:14:43 in the logs? They may be stale messages from the original image build. The application was not released until 13:48:41 UTC. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I was going to check myself after Rasputin42's comment. Will update it tomorrow. Why 2241? |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
Why 13:14:43 in the logs? They may be stale messages from the original image build. The application was not released until 13:48:41 UTC. CONDOR seems to like Pacific rather than UTC! 70mins into the job now, the Swap is at 200M. [edit] My first run ended, but now the log says: 04/18/16 13:16:44 Running job as user nobody 04/18/16 13:16:44 Create_Process succeeded, pid=4668 04/18/16 14:53:28 Process exited, pid=4668, status=0 04/18/16 14:54:02 condor_write(): Socket closed when trying to write 65536 bytes to daemon at <188.184.187.167:9618>, fd is 14 04/18/16 14:54:02 ReliSock::put_bytes_nobuffer: Send failed. 04/18/16 14:54:02 ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1) 04/18/16 14:54:02 DoUpload: STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error sending /var/lib/condor/execute/dir_4661/EVNT.06480895._029668.pool.root.1 04/18/16 14:54:02 File transfer failed, forcing disconnect. 04/18/16 14:54:02 Returning from CStarter::JobReaper() 04/18/16 14:55:03 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: WRITE authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 188.184.187.167,alicondorce01.cern.ch, hostname size = 1, original ip address = 188.184.187.167 04/18/16 14:55:11 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason 04/18/16 14:55:28 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason 04/18/16 14:56:00 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason [/edit] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
First "job" finished after 3h40min. Got an output on console F5. ...gfal-copy error: 2 (no such file....)TRANSFER failed to put the file (HTTP404). So far it is not getting a new job. If it does not in the next few minutes, i will shut it down. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I was going to check myself after Rasputin42's comment. Will update it tomorrow. Why 2241? IIRC: With 2048MB there were jobs running endless due to lack of memory, but IIRC again they also not used swap space in the VM. You have configured, that the task will be killed after 18 hours, but it would be a pity, if lack of memory would be the reason for that. Yeah, 2241 is a strange number, but I think they want to use as much as needed and as low as possible? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The tasks i ran last night ended after a few minutes. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=148368 I stopped calculating them. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The Condor Server fell over last night due to a full disk. It is up and running again now. |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
The Condor Server fell over last night due to a full disk. It is up and running again now. Yep I grabbed some and they're running jobs. Looks like someone fixed apache to show the logs now, still need to up the mem allocation to avoid paging out tho. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
A new version (v0.2) is available with the memory set to 2241MB. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Iam trying to start a 2nd atlas task. It is running for 7 min and finish. What actually happens, if you do a version change, while the old is still running a task? 2016-04-20 09:36:46 (668): Guest Log: [ERROR] App is not supported. Shutting down! EDIT: Are there no jobs?Boinc-tasks are available. |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
A new version (v0.2) is available with the memory set to 2241MB. 2 competed fine so far. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I did a project reset. It is very difficult to actually get a task running. Out of 10 tasks, i can only get 1 running longer than 10 min. If we are out of jobs, could you please post a link, where we can see, if there are JOBS available?(Not boinc-tasks, atlas-jobs) |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
I was able to run tasks on mac but on linux I get the same errors as Rasputin42: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=149233 |
©2024 CERN