Message boards :
Theory Application :
Task startup issue
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I repeated the test. When starting up 3 tasks, 2 error out and the quota prevents the 3rd task to start. (plenty of memory and cpu assigned/available) 5/20/2016 10:03:03 AM | vLHCathome-dev | Scheduler request completed: got 4 new tasks 5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_25769_1463691500.251851_0 5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_4447_1463662382.608424_0 5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_21880_1463687597.174610_0 5/20/2016 10:03:49 AM | vLHCathome-dev | Computation for task Theory_4447_1463662382.608424_0 finished 5/20/2016 10:03:49 AM | vLHCathome-dev | Computation for task Theory_21880_1463687597.174610_0 finished 5/20/2016 10:03:49 AM | vLHCathome-dev | Starting task Theory_22069_1463688197.530422_0 5/20/2016 10:06:10 AM | vLHCathome-dev | Sending scheduler request: To report completed tasks. 5/20/2016 10:06:10 AM | vLHCathome-dev | Reporting 2 completed tasks 5/20/2016 10:06:10 AM | vLHCathome-dev | Requesting new tasks for CPU 5/20/2016 10:06:12 AM | vLHCathome-dev | Scheduler request completed: got 0 new tasks 5/20/2016 10:06:12 AM | vLHCathome-dev | No tasks sent 5/20/2016 10:06:12 AM | vLHCathome-dev | No tasks are available for Theory Simulation 5/20/2016 10:06:12 AM | vLHCathome-dev | This computer has finished a daily quota of 1 tasks |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 846,901 RAC: 2,193 |
Have a look in your vbox.logs I've seen in the past starting several VM's at the same time, that there could be a locking issue when attaching Virtual Box Extension Pack. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,531 RAC: 199 |
Yep, you can see the problem in the task log http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177516. Not sure why this is happening but it will be a general issue and not specific the the Theory app. 2016-05-20 10:03:36 (2992): Adding VirtualBox Guest Additions to VM. 2016-05-20 10:03:36 (2992): Error 0x80bb0001 in vbox50::VBOX_VM::create_vm (c:\src\boinc\boinc\samples\vboxwrapper\vbox_mscom_impl.cpp:710) 2016-05-20 10:03:36 (2992): Error Source : VirtualBoxWrap 2016-05-20 10:03:36 (2992): Error Description: Cannot register the DVD image 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' {41753763-2a21-43b1-a3fc-3231b85d41ae} because a CD/DVD image 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' with UUID {c88268ba-20b4-4b5c-8dcb-f4304ea7cb7d} already exists 2016-05-20 10:03:37 (2992): Powering off VM. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks. So, it is a vbox issue, as it should be able to handle this. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had a few tasks, were all the consoles did not show anything (apart from the dummy message). No link to the logs through " show graphics". Only the TOP command shows a process "java <defunct>" with high cpu usage and NO MEMORY usage (no RES,VIRT or otherwise). Running as a first job in the task for more than 120min and no sign of terminating itself. http://lhcathomedev.cern.ch/vLHCathome-dev/workunit.php?wuid=208983 EDIT:I am going to gracefully shutdown the task at 20.30UTC. If anyone wants any more info, please ask before then. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Are we still testing? Has anyone seen the kind of fault, i described below? I call it "java <defunct>" fault. There are also still the occasional ultra long jobs, that exceed the 18h limit. Is anything done about it? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 846,901 RAC: 2,193 |
Has anyone seen the kind of fault, i described below? I've seen this only once. That was with the single ALICE task I ran. The "java <defunct>" process shown in the 'top' window, was using almost 100% CPU, but it was a so called 'Zombie'-process. That means that no normal process was running and the VM also couldn't create the 'shutdown' message into the shared folder after the 'normal' 12 hours runtime. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 181 |
Has anyone seen the kind of fault, i described below? Like this? This occurred on startup of a theory task in the beta project - seems to be stuck. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Crystal. I have had this at least 5 or 6 times. Next time, i will save the logs(they are actually there, contrary to what i said before) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, m, exactly like this. This only happens a the beginning of a tasks. Once a task starts processing the first job, everything should be OK. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Just got another one.EDIT: and an other one. Start.log: 07/12/16 21:40:25 Authorizations yet to be resolved: 07/12/16 21:40:25 allow READ: condor@fsauth/277-617-1197.277-617-1197 root@fsauth/277-617-1197.277-617-1197 07/12/16 21:40:25 allow WRITE: submit-side@matchsession/alicondorce01.cern.ch submit-side@matchsession/188.184.187.167 central-manager@277-617-1197/alicondor01.277-617-1197 condor@fsauth/* submit-side@matchsession/188.185.164.231 submit-side@matchsession/188.185.164.229 */10.0.11.58 submit-side@matchsession/ce504.cern.ch submit-side@matchsession/ce502.cern.ch submit-side@matchsession/ce503.cern.ch submit-side@matchsession/ce501.cern.ch worker-node@277-617-1197/277-617-1197.277-617-1197 central-manager@277-617-1197/277-617-1197.277-617-1197 computing-element@277-617-1197/277-617-1197.277-617-1197 condor@277-617-1197/277-617-1197.277-617-1197 root@fsauth/277-617-1197.277-617-1197 07/12/16 21:40:25 deny WRITE: anonymous@*/* *@unmapped/* 07/12/16 21:40:25 allow NEGOTIATOR: central-manager@277-617-1197/alicondor01.277-617-1197 07/12/16 21:40:25 deny NEGOTIATOR: anonymous@*/* *@unmapped/* 07/12/16 21:40:25 allow ADMINISTRATOR: worker-node@277-617-1197/277-617-1197.277-617-1197 central-manager@277-617-1197/277-617-1197.277-617-1197 computing-element@277-617-1197/277-617-1197.277-617-1197 condor@277-617-1197/277-617-1197.277-617-1197 condor@fsauth/277-617-1197.277-617-1197 */128.142.154.202 central-manager@277-617-1197/alicondor01.277-617-1197 */batchman.cern.ch |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,531 RAC: 199 |
I think the "java |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 181 |
Thanks, Laurence. These zombies are associated with many of the 206 errors I see. Hopefully your fix will make a difference. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 181 |
I think the "java <defunct>" processes can be cleaned up by tweaking the Condor configuration. Will look into it tomorrow. These failures are still occurring. Has the "FIX" been done? |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,531 RAC: 199 |
I have just committed a fix for the "java |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What exactly is "lost ratio". What is lost and where? http://mcplots-dev.cern.ch/production.php?view=status&plots=daily#plots |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 181 |
I have just committed a fix for the "java <defunct>" fault. The problem was that when Condor starts, it calls the JVM to publish some information about the Java environment in a ClassAd and this call hangs. The solution is just to tell Condor that there is no Java on the machine. Excellent, thanks Laurence. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I set the preferences on the account to 1 task 1 core. It downloads 3 tasks (i have 4 cores) Starting from scratch, it started a tasks with 2 cores (boinc lists 1.5 cores). Shutting down boinc and restart. Finishing the task---next one runs on 2 cores again. In other words, single core operation does not work. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 846,901 RAC: 2,193 |
In preferences I set: The maximum number of tasks per host 2 The maximum number of cores per task 0 and after each change I updated my client. I still get 8 tasks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I detached from the project and reattached. In preferences I set: The maximum number of tasks per host 0 The maximum number of cores per task 0 Now, it is running 3 task with 2 cores each. (i only have 4 cores, so why does boinc allow it, as it thinks, it is running 3 * 1.5cores= 4.5?) EDIT: app_config still lists everything with 1. EDIT2: Re-reading the config file turns 2 of the 3 task to "waiting". EDIT 3: setting app_config theory tasks to 3 and re-read turns 2 tasks running, one waiting, as it should. |
©2024 CERN