Message boards : Theory Application : Ready For Production
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2911 - Posted: 21 Apr 2016, 19:22:47 UTC

Over the next week the focus will be on improving the Theory application to the quality required for it to be promoted to the production project. As many components are now shared between all the applications, effort spent here will also benefit the other applications. Please post to this thread any issues that need addressing to bring the Theory application to at least the comparable quality of the Theory application that is currently running in vLHC@home.
ID: 2911 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2914 - Posted: 21 Apr 2016, 20:33:01 UTC
Last modified: 21 Apr 2016, 20:49:52 UTC

The startup is not working well.That needs more work.

EDIT: It is also not possible to identify the actual app running from logs or console (pythia6.xxx or others)

04/21/16 22:15:23 Got activate_claim request from shadow (188.184.187.167)
04/21/16 22:15:23 Remote job ID is 271581.0
04/21/16 22:15:23 Got universe "VANILLA" (5) from request classad
04/21/16 22:15:23 State change: claim-activation protocol successful
04/21/16 22:15:23 Changing activity: Idle -> Busy
04/21/16 22:15:26 PERMISSION DENIED to condor@277-617-9521 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
04/21/16 22:15:29 State change: benchmarks completed
04/21/16 22:19:54 Called deactivate_claim_forcibly()
04/21/16 22:19:56 Got activate claim while starter is still alive.
04/21/16 22:19:56 Telling shadow to try again later.
04/21/16 22:19:58 Got activate claim while starter is still alive.
04/21/16 22:19:58 Telling shadow to try again later.
04/21/16 22:19:59 Got activate claim while starter is still alive.
04/21/16 22:19:59 Telling shadow to try again later.
04/21/16 22:20:00 Got activate claim while starter is still alive.
04/21/16 22:20:00 Telling shadow to try again later.
04/21/16 22:20:01 Got activate claim while starter is still alive.
04/21/16 22:20:01 Telling shadow to try again later.
04/21/16 22:20:02 Got activate claim while starter is still alive.
04/21/16 22:20:02 Telling shadow to try again later.
04/21/16 22:20:03 Got activate claim while starter is still alive.
04/21/16 22:20:03 Telling shadow to try again later.
04/21/16 22:20:04 Got activate claim while starter is still alive.
04/21/16 22:20:04 Telling shadow to try again later.
04/21/16 22:20:06 Got activate claim while starter is still alive.
04/21/16 22:20:06 Telling shadow to try again later.
04/21/16 22:20:07 Got activate claim while starter is still alive.
04/21/16 22:20:07 Telling shadow to try again later.
04/21/16 22:20:08 Got activate claim while starter is still alive.
04/21/16 22:20:08 Telling shadow to try again later.
04/21/16 22:20:09 Got activate claim while starter is still alive.
04/21/16 22:20:09 Telling shadow to try again later.
04/21/16 22:20:10 Got activate claim while starter is still alive.
04/21/16 22:20:10 Telling shadow to try again later.
04/21/16 22:20:11 Got activate claim while starter is still alive.
04/21/16 22:20:11 Telling shadow to try again later.
04/21/16 22:20:12 Got activate claim while starter is still alive.
04/21/16 22:20:12 Telling shadow to try again later.
04/21/16 22:20:13 Got activate claim while starter is still alive.
04/21/16 22:20:13 Telling shadow to try again later.
04/21/16 22:20:15 Got activate claim while starter is still alive.
04/21/16 22:20:15 Telling shadow to try again later.
04/21/16 22:20:16 Got activate claim while starter is still alive.
04/21/16 22:20:16 Telling shadow to try again later.
04/21/16 22:20:17 Got activate claim while starter is still alive.
04/21/16 22:20:17 Telling shadow to try again later.
04/21/16 22:20:18 Got activate claim while starter is still alive.
04/21/16 22:20:18 Telling shadow to try again later.
04/21/16 22:20:19 Got activate claim while starter is still alive.
04/21/16 22:20:19 Telling shadow to try again later.
04/21/16 22:20:19 Called deactivate_claim()
04/21/16 22:20:20 State change: received RELEASE_CLAIM command
04/21/16 22:20:20 Changing state and activity: Claimed/Busy -> Preempting/Vacating
04/21/16 22:20:24 starter (pid 3619) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents.
04/21/16 22:20:24 Starter pid 3619 died on signal 9 (signal 9 (Killed))
04/21/16 22:20:24 State change: starter exited
04/21/16 22:20:24 State change: No preempting claim, returning to owner
04/21/16 22:20:24 Changing state and activity: Preempting/Vacating -> Owner/Idle
04/21/16 22:20:24 State change: IS_OWNER is false
04/21/16 22:20:24 Changing state: Owner -> Unclaimed
04/21/16 22:21:22 Request accepted.
04/21/16 22:21:22 Remote owner is test4theory@cern.ch
04/21/16 22:21:22 State change: claiming protocol successful
04/21/16 22:21:22 Changing state: Unclaimed -> Claimed
04/21/16 22:21:23 Got activate_claim request from shadow (188.184.187.167)
04/21/16 22:21:23 Remote job ID is 271583.0
04/21/16 22:21:23 Got universe "VANILLA" (5) from request classad
04/21/16 22:21:23 State change: claim-activation protocol successful
ID: 2914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3206 - Posted: 3 May 2016, 20:16:38 UTC - in response to Message 2911.  

How are things looking now? Apart from the suspend/resume issue that I have started to look into again. Whatelse needs to be done?
ID: 3206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3208 - Posted: 3 May 2016, 20:28:27 UTC - in response to Message 3206.  
Last modified: 3 May 2016, 20:29:10 UTC

There should be a check, if a job is actually likely to finish before the end of the task.
I had a number of jobs(longest, i had 55858 sec)that are so long, the exceeded the 18h limit, especially, if other jobs had been done before in the same task.
ID: 3208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 2
Message 3212 - Posted: 3 May 2016, 21:20:59 UTC

Hi Laurence,

Linux-Guest's under Windows need AMD-V or VT-X.

Production vLHCathome 32-bit don't need it and works well.

This mean, that 64-bit vLHCathome can't run in Linux-Guest's.

In the Linux-Guest is AMD-V or VT-X active in the Session-Information.

When Challenge under WEBAPI and vLHCathome-dev under Boinc running together,
both work with AMD-V or VT-X.

Hmmmmm..... any idea?
ID: 3212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 3215 - Posted: 3 May 2016, 21:42:31 UTC

I think the Theory application has achieved production line status.

The issue with extreme long running jobs is very rarely and does not supersede the importance,
that almost no jobs are killed anymore at the end of a task like is/was done with all tasks the last 5 years at vLHCathome former Test4Theory after 24 hours run time.

As you mentioned, Laurence, the suspend/resume negative influence on not successful delivered jobs should be solved at least maximal reduced.
Using the VM-mechanism adopted by BOINC is just because of saving the state of a task/job without loss of CPU-cycles.
ID: 3215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3218 - Posted: 3 May 2016, 21:50:46 UTC - in response to Message 3208.  

We can always adjust the limit if it is too shot for some jobs. Experience should help us set these values and limits.
ID: 3218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3221 - Posted: 3 May 2016, 22:08:38 UTC - in response to Message 3218.  

Just saying. I would not exactly be very happy to have wasted 18h of computing time, not even had the chance for it to succeed.
On slower computers, these numbers would be even worse.

I will keep an eye on it, how often these extra long tasks occur.
ID: 3221 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 3227 - Posted: 4 May 2016, 6:01:07 UTC - in response to Message 3221.  

I will keep an eye on it, how often these extra long tasks occur.

What I've seen in the past, it was always a Sherpa job running that long.
I even watched a Sherpa running longer than 24 hours (years ago).
Somehow job creation knows that a job was too long, cause you sometimes see a job with not the maximum of 100,000 events. Also from other generators.
Once I got a sherpa with 'only' 6,000 events.

Though, there is a chance a task is looping and consuming cycles not doing real work.
If that's the case, that's where this 18- or 24 hours-limit is designed for.
ID: 3227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 3235 - Posted: 4 May 2016, 9:36:59 UTC - in response to Message 3206.  

How are things looking now? Apart from the suspend/resume issue that I have started to look into again. Whatelse needs to be done?

No showstopper, but is working with the current 32bit-production version: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=217
ID: 3235 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3335 - Posted: 13 May 2016, 10:17:28 UTC - in response to Message 3235.  

How are things looking now? Unless someone complains, nothing else will get fixed :)
ID: 3335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 3491 - Posted: 26 May 2016, 9:55:53 UTC - in response to Message 3335.  

How are things looking now? Unless someone complains, nothing else will get fixed :)

Nothing to fix anymore, I think. Ready for production now?
ID: 3491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3493 - Posted: 26 May 2016, 11:38:07 UTC - in response to Message 3491.  
Last modified: 26 May 2016, 11:38:47 UTC

We think so. The Condor server that is used will be switched to a production instance, which will be done within the next few days. The plan is to update the production project on Monday June 6th.
ID: 3493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3522 - Posted: 31 May 2016, 13:48:32 UTC - in response to Message 3493.  

It has been announced.

http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1807
ID: 3522 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3542 - Posted: 6 Jun 2016, 9:16:31 UTC - in response to Message 3522.  

It has been released.
ID: 3542 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : Ready For Production


©2024 CERN