Message boards : ATLAS Application : New Experimental ATLAS Application
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2814 - Posted: 18 Apr 2016, 14:28:18 UTC

A new experimental ATLAS application following a similar approach to the Theory application, is now available for evaluation. The jobs are real ATLAS jobs but for now output will not be used, so if any of you would actually like to crunch for ATLAS please do so via their production project.

http://atlasathome.cern.ch/

For the rest of you who like to provide brain cycles rather than compute cycle, feel free to try it and provide feedback.

Don't forget that you can adjust which applications that you would like to run via the project preferences.

http://lhcathomedev.cern.ch/vLHCathome-dev/prefs.php?subset=project
ID: 2814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2815 - Posted: 18 Apr 2016, 15:03:21 UTC - in response to Message 2814.  

No Atlas tasks available.
ID: 2815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2816 - Posted: 18 Apr 2016, 15:38:53 UTC - in response to Message 2815.  

Patience. We are just checking a few things before sending out some tasks.
ID: 2816 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2817 - Posted: 18 Apr 2016, 17:38:40 UTC - in response to Message 2816.  

The first 50 tasks are available, get them while you can.
ID: 2817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2818 - Posted: 18 Apr 2016, 18:53:59 UTC
Last modified: 18 Apr 2016, 18:58:36 UTC

First impressions:
Show graphics-no output.
Dominant process athena.py (res mem 1.3g? virt mem 2149m?)
2GB mem used to the full, some swapfile as well(currently 85MB).
Mem seems a bit tight.
ID: 2818 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2820 - Posted: 18 Apr 2016, 20:39:06 UTC

The ATLAS production project needs 2241MB memory for the VM.
In the past it's extended to that from 2048MB for a reason, I suppose.
ID: 2820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 2821 - Posted: 18 Apr 2016, 20:48:06 UTC - in response to Message 2818.  
Last modified: 18 Apr 2016, 20:53:35 UTC

Show graphics-no output.

Have a look at localhost:port/logs

Mine started doing this every 2 minutes:
04/18/16 13:14:43 ** condor_starter (CONDOR_STARTER) STARTING UP
04/18/16 13:14:43 ** /usr/sbin/condor_starter
04/18/16 13:14:43 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/18/16 13:14:43 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/18/16 13:14:43 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $
04/18/16 13:14:43 ** $CondorPlatform: x86_64_RedHat6 $
04/18/16 13:14:43 ** PID = 31388
04/18/16 13:14:43 ** Log last touched 4/18 13:14:42
04/18/16 13:14:43 ******************************************************
04/18/16 13:14:43 Using config source: /etc/condor/condor_config
04/18/16 13:14:43 Using local config sources:
04/18/16 13:14:43 /etc/condor/config.d/10_security.config
04/18/16 13:14:43 /etc/condor/config.d/14_network.config
04/18/16 13:14:43 /etc/condor/config.d/20_workernode.config
04/18/16 13:14:43 /etc/condor/config.d/30_lease.config
04/18/16 13:14:43 /etc/condor/config.d/35_atlas.config
04/18/16 13:14:43 /etc/condor/config.d/40_ccb.config
04/18/16 13:14:43 /etc/condor/condor_config.local
04/18/16 13:14:43 Daemon Log is logging: D_ALWAYS D_ERROR
04/18/16 13:14:43 DaemonCore: command socket at <10.0.2.15:33345?noUDP>
04/18/16 13:14:43 DaemonCore: private command socket at <10.0.2.15:33345>
04/18/16 13:14:43 ERROR: Could not open canonicalization file '/etc/condor/certificate_mapfile' (No such file or directory)
04/18/16 13:14:44 CCBListener: heartbeat disabled because interval is configured to be 0
04/18/16 13:14:44 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#9181
04/18/16 13:14:44 Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=6941_4ff3_269122>
04/18/16 13:14:44 Submitting machine is "alicondorce01.cern.ch"
04/18/16 13:14:45 setting the orig job name in starter
04/18/16 13:14:45 setting the orig job iwd in starter
04/18/16 13:14:45 Job has WantIOProxy=true
04/18/16 13:14:45 Initialized IO Proxy.
04/18/16 13:14:45 Done setting resource limits
04/18/16 13:14:45 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:54469>, fd is 14
04/18/16 13:14:45 Buf::write(): condor_write() failed
04/18/16 13:14:45 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:54469> (try 1 of 3): CEDAR:6002:failed to send EOM
04/18/16 13:14:45 File transfer completed successfully.
04/18/16 13:14:46 Job 268269.0 set to execute immediately
04/18/16 13:14:46 Starting a VANILLA universe job with ID: 268269.0
04/18/16 13:14:46 IWD: /var/lib/condor/execute/dir_31388
04/18/16 13:14:46 Output file: /var/lib/condor/execute/dir_31388/_condor_stdout
04/18/16 13:14:46 Error file: /var/lib/condor/execute/dir_31388/_condor_stderr
04/18/16 13:14:46 Renice expr "10" evaluated to 10
04/18/16 13:14:46 Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_31388/condor_exec.exe
04/18/16 13:14:46 Setting job's virtual memory rlimit to 0 megabytes
04/18/16 13:14:46 Running job as user nobody
04/18/16 13:14:46 Create_Process succeeded, pid=31395
04/18/16 13:16:31 Process exited, pid=31395, status=0
04/18/16 13:16:39 Got SIGQUIT. Performing fast shutdown.
04/18/16 13:16:39 ShutdownFast all jobs.
04/18/16 13:16:39 **** condor_starter (condor_STARTER) pid 31388 EXITING WITH STATUS 0

But now its picked up a job thats been running 20mins so far.
It seems to be close on available memory, pagefile at 125M and climbing throught the run.
ID: 2821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2822 - Posted: 18 Apr 2016, 21:15:28 UTC - in response to Message 2821.  

Why 13:14:43 in the logs? They may be stale messages from the original image build. The application was not released until 13:48:41 UTC.
ID: 2822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2823 - Posted: 18 Apr 2016, 21:17:56 UTC - in response to Message 2820.  

I was going to check myself after Rasputin42's comment. Will update it tomorrow. Why 2241?
ID: 2823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 2824 - Posted: 18 Apr 2016, 21:32:00 UTC - in response to Message 2822.  
Last modified: 18 Apr 2016, 21:58:47 UTC

Why 13:14:43 in the logs? They may be stale messages from the original image build. The application was not released until 13:48:41 UTC.


CONDOR seems to like Pacific rather than UTC!


70mins into the job now, the Swap is at 200M.

[edit]

My first run ended, but now the log says:
04/18/16 13:16:44 Running job as user nobody
04/18/16 13:16:44 Create_Process succeeded, pid=4668
04/18/16 14:53:28 Process exited, pid=4668, status=0
04/18/16 14:54:02 condor_write(): Socket closed when trying to write 65536 bytes to daemon at <188.184.187.167:9618>, fd is 14
04/18/16 14:54:02 ReliSock::put_bytes_nobuffer: Send failed.
04/18/16 14:54:02 ReliSock::put_file: failed to put 65536 bytes (put_bytes_nobuffer() returned -1)
04/18/16 14:54:02 DoUpload: STARTER at 10.0.2.15 failed to send file(s) to <188.184.187.167:9618>: error sending /var/lib/condor/execute/dir_4661/EVNT.06480895._029668.pool.root.1
04/18/16 14:54:02 File transfer failed, forcing disconnect.
04/18/16 14:54:02 Returning from CStarter::JobReaper()
04/18/16 14:55:03 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: WRITE authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 188.184.187.167,alicondorce01.cern.ch, hostname size = 1, original ip address = 188.184.187.167
04/18/16 14:55:11 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason
04/18/16 14:55:28 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason
04/18/16 14:56:00 PERMISSION DENIED to submit-side@matchsession from host 188.184.187.167 for command 1200 (CA_CMD), access level WRITE: reason: cached result for WRITE; see first case for the full reason
[/edit]
ID: 2824 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2825 - Posted: 18 Apr 2016, 22:27:14 UTC

First "job" finished after 3h40min.
Got an output on console F5.

...gfal-copy error: 2 (no such file....)TRANSFER failed to put the file (HTTP404).

So far it is not getting a new job. If it does not in the next few minutes, i will shut it down.
ID: 2825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 2832 - Posted: 19 Apr 2016, 7:28:56 UTC - in response to Message 2823.  
Last modified: 19 Apr 2016, 7:32:02 UTC

I was going to check myself after Rasputin42's comment. Will update it tomorrow. Why 2241?

IIRC: With 2048MB there were jobs running endless due to lack of memory, but IIRC again they also not used swap space in the VM.

You have configured, that the task will be killed after 18 hours, but it would be a pity, if lack of memory would be the reason for that.

Yeah, 2241 is a strange number, but I think they want to use as much as needed and as low as possible?
ID: 2832 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2837 - Posted: 19 Apr 2016, 8:27:05 UTC

The tasks i ran last night ended after a few minutes.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=148368

I stopped calculating them.
ID: 2837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2838 - Posted: 19 Apr 2016, 9:02:07 UTC - in response to Message 2837.  

The Condor Server fell over last night due to a full disk. It is up and running again now.
ID: 2838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 2846 - Posted: 19 Apr 2016, 12:40:10 UTC - in response to Message 2838.  

The Condor Server fell over last night due to a full disk. It is up and running again now.

Yep I grabbed some and they're running jobs.

Looks like someone fixed apache to show the logs now, still need to up the mem allocation to avoid paging out tho.
ID: 2846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 2851 - Posted: 19 Apr 2016, 13:16:07 UTC - in response to Message 2846.  

A new version (v0.2) is available with the memory set to 2241MB.
ID: 2851 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2872 - Posted: 20 Apr 2016, 7:46:30 UTC - in response to Message 2851.  
Last modified: 20 Apr 2016, 7:59:18 UTC

Iam trying to start a 2nd atlas task.
It is running for 7 min and finish.
What actually happens, if you do a version change, while the old is still running a task?

2016-04-20 09:36:46 (668): Guest Log: [ERROR] App is not supported. Shutting down!


EDIT: Are there no jobs?Boinc-tasks are available.
ID: 2872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 2873 - Posted: 20 Apr 2016, 8:07:45 UTC - in response to Message 2851.  

A new version (v0.2) is available with the memory set to 2241MB.

2 competed fine so far.
ID: 2873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2874 - Posted: 20 Apr 2016, 11:28:46 UTC

I did a project reset.
It is very difficult to actually get a task running.
Out of 10 tasks, i can only get 1 running longer than 10 min.
If we are out of jobs, could you please post a link, where we can see, if there are JOBS available?(Not boinc-tasks, atlas-jobs)
ID: 2874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 2882 - Posted: 21 Apr 2016, 9:41:42 UTC

I was able to run tasks on mac but on linux I get the same errors as Rasputin42:

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=149233
ID: 2882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : ATLAS Application : New Experimental ATLAS Application


©2024 CERN