Message boards :
Theory Application :
Ready For Production
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Over the next week the focus will be on improving the Theory application to the quality required for it to be promoted to the production project. As many components are now shared between all the applications, effort spent here will also benefit the other applications. Please post to this thread any issues that need addressing to bring the Theory application to at least the comparable quality of the Theory application that is currently running in vLHC@home. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The startup is not working well.That needs more work. EDIT: It is also not possible to identify the actual app running from logs or console (pythia6.xxx or others) 04/21/16 22:15:23 Got activate_claim request from shadow (188.184.187.167) 04/21/16 22:15:23 Remote job ID is 271581.0 04/21/16 22:15:23 Got universe "VANILLA" (5) from request classad 04/21/16 22:15:23 State change: claim-activation protocol successful 04/21/16 22:15:23 Changing activity: Idle -> Busy 04/21/16 22:15:26 PERMISSION DENIED to condor@277-617-9521 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 04/21/16 22:15:29 State change: benchmarks completed 04/21/16 22:19:54 Called deactivate_claim_forcibly() 04/21/16 22:19:56 Got activate claim while starter is still alive. 04/21/16 22:19:56 Telling shadow to try again later. 04/21/16 22:19:58 Got activate claim while starter is still alive. 04/21/16 22:19:58 Telling shadow to try again later. 04/21/16 22:19:59 Got activate claim while starter is still alive. 04/21/16 22:19:59 Telling shadow to try again later. 04/21/16 22:20:00 Got activate claim while starter is still alive. 04/21/16 22:20:00 Telling shadow to try again later. 04/21/16 22:20:01 Got activate claim while starter is still alive. 04/21/16 22:20:01 Telling shadow to try again later. 04/21/16 22:20:02 Got activate claim while starter is still alive. 04/21/16 22:20:02 Telling shadow to try again later. 04/21/16 22:20:03 Got activate claim while starter is still alive. 04/21/16 22:20:03 Telling shadow to try again later. 04/21/16 22:20:04 Got activate claim while starter is still alive. 04/21/16 22:20:04 Telling shadow to try again later. 04/21/16 22:20:06 Got activate claim while starter is still alive. 04/21/16 22:20:06 Telling shadow to try again later. 04/21/16 22:20:07 Got activate claim while starter is still alive. 04/21/16 22:20:07 Telling shadow to try again later. 04/21/16 22:20:08 Got activate claim while starter is still alive. 04/21/16 22:20:08 Telling shadow to try again later. 04/21/16 22:20:09 Got activate claim while starter is still alive. 04/21/16 22:20:09 Telling shadow to try again later. 04/21/16 22:20:10 Got activate claim while starter is still alive. 04/21/16 22:20:10 Telling shadow to try again later. 04/21/16 22:20:11 Got activate claim while starter is still alive. 04/21/16 22:20:11 Telling shadow to try again later. 04/21/16 22:20:12 Got activate claim while starter is still alive. 04/21/16 22:20:12 Telling shadow to try again later. 04/21/16 22:20:13 Got activate claim while starter is still alive. 04/21/16 22:20:13 Telling shadow to try again later. 04/21/16 22:20:15 Got activate claim while starter is still alive. 04/21/16 22:20:15 Telling shadow to try again later. 04/21/16 22:20:16 Got activate claim while starter is still alive. 04/21/16 22:20:16 Telling shadow to try again later. 04/21/16 22:20:17 Got activate claim while starter is still alive. 04/21/16 22:20:17 Telling shadow to try again later. 04/21/16 22:20:18 Got activate claim while starter is still alive. 04/21/16 22:20:18 Telling shadow to try again later. 04/21/16 22:20:19 Got activate claim while starter is still alive. 04/21/16 22:20:19 Telling shadow to try again later. 04/21/16 22:20:19 Called deactivate_claim() 04/21/16 22:20:20 State change: received RELEASE_CLAIM command 04/21/16 22:20:20 Changing state and activity: Claimed/Busy -> Preempting/Vacating 04/21/16 22:20:24 starter (pid 3619) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents. 04/21/16 22:20:24 Starter pid 3619 died on signal 9 (signal 9 (Killed)) 04/21/16 22:20:24 State change: starter exited 04/21/16 22:20:24 State change: No preempting claim, returning to owner 04/21/16 22:20:24 Changing state and activity: Preempting/Vacating -> Owner/Idle 04/21/16 22:20:24 State change: IS_OWNER is false 04/21/16 22:20:24 Changing state: Owner -> Unclaimed 04/21/16 22:21:22 Request accepted. 04/21/16 22:21:22 Remote owner is test4theory@cern.ch 04/21/16 22:21:22 State change: claiming protocol successful 04/21/16 22:21:22 Changing state: Unclaimed -> Claimed 04/21/16 22:21:23 Got activate_claim request from shadow (188.184.187.167) 04/21/16 22:21:23 Remote job ID is 271583.0 04/21/16 22:21:23 Got universe "VANILLA" (5) from request classad 04/21/16 22:21:23 State change: claim-activation protocol successful |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
How are things looking now? Apart from the suspend/resume issue that I have started to look into again. Whatelse needs to be done? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There should be a check, if a job is actually likely to finish before the end of the task. I had a number of jobs(longest, i had 55858 sec)that are so long, the exceeded the 18h limit, especially, if other jobs had been done before in the same task. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
Hi Laurence, Linux-Guest's under Windows need AMD-V or VT-X. Production vLHCathome 32-bit don't need it and works well. This mean, that 64-bit vLHCathome can't run in Linux-Guest's. In the Linux-Guest is AMD-V or VT-X active in the Session-Information. When Challenge under WEBAPI and vLHCathome-dev under Boinc running together, both work with AMD-V or VT-X. Hmmmmm..... any idea? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I think the Theory application has achieved production line status. The issue with extreme long running jobs is very rarely and does not supersede the importance, that almost no jobs are killed anymore at the end of a task like is/was done with all tasks the last 5 years at vLHCathome former Test4Theory after 24 hours run time. As you mentioned, Laurence, the suspend/resume negative influence on not successful delivered jobs should be solved at least maximal reduced. Using the VM-mechanism adopted by BOINC is just because of saving the state of a task/job without loss of CPU-cycles. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
We can always adjust the limit if it is too shot for some jobs. Experience should help us set these values and limits. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Just saying. I would not exactly be very happy to have wasted 18h of computing time, not even had the chance for it to succeed. On slower computers, these numbers would be even worse. I will keep an eye on it, how often these extra long tasks occur. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I will keep an eye on it, how often these extra long tasks occur. What I've seen in the past, it was always a Sherpa job running that long. I even watched a Sherpa running longer than 24 hours (years ago). Somehow job creation knows that a job was too long, cause you sometimes see a job with not the maximum of 100,000 events. Also from other generators. Once I got a sherpa with 'only' 6,000 events. Though, there is a chance a task is looping and consuming cycles not doing real work. If that's the case, that's where this 18- or 24 hours-limit is designed for. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
How are things looking now? Apart from the suspend/resume issue that I have started to look into again. Whatelse needs to be done? No showstopper, but is working with the current 32bit-production version: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=217 |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
How are things looking now? Unless someone complains, nothing else will get fixed :) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
How are things looking now? Unless someone complains, nothing else will get fixed :) Nothing to fix anymore, I think. Ready for production now? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
We think so. The Condor server that is used will be switched to a production instance, which will be done within the next few days. The plan is to update the production project on Monday June 6th. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It has been announced. http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1807 |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It has been released. |
©2024 CERN