Message boards : Theory Application : Task startup issue
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3414 - Posted: 20 May 2016, 8:14:44 UTC
Last modified: 20 May 2016, 8:15:16 UTC

I repeated the test.
When starting up 3 tasks, 2 error out and the quota prevents the 3rd task to start.
(plenty of memory and cpu assigned/available)

5/20/2016 10:03:03 AM | vLHCathome-dev | Scheduler request completed: got 4 new tasks
5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_25769_1463691500.251851_0
5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_4447_1463662382.608424_0
5/20/2016 10:03:05 AM | vLHCathome-dev | Starting task Theory_21880_1463687597.174610_0
5/20/2016 10:03:49 AM | vLHCathome-dev | Computation for task Theory_4447_1463662382.608424_0 finished
5/20/2016 10:03:49 AM | vLHCathome-dev | Computation for task Theory_21880_1463687597.174610_0 finished
5/20/2016 10:03:49 AM | vLHCathome-dev | Starting task Theory_22069_1463688197.530422_0
5/20/2016 10:06:10 AM | vLHCathome-dev | Sending scheduler request: To report completed tasks.
5/20/2016 10:06:10 AM | vLHCathome-dev | Reporting 2 completed tasks
5/20/2016 10:06:10 AM | vLHCathome-dev | Requesting new tasks for CPU
5/20/2016 10:06:12 AM | vLHCathome-dev | Scheduler request completed: got 0 new tasks
5/20/2016 10:06:12 AM | vLHCathome-dev | No tasks sent
5/20/2016 10:06:12 AM | vLHCathome-dev | No tasks are available for Theory Simulation
5/20/2016 10:06:12 AM | vLHCathome-dev | This computer has finished a daily quota of 1 tasks
ID: 3414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3417 - Posted: 20 May 2016, 9:17:53 UTC

Have a look in your vbox.logs

I've seen in the past starting several VM's at the same time, that there could be a locking issue when attaching Virtual Box Extension Pack.
ID: 3417 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3423 - Posted: 20 May 2016, 14:34:14 UTC - in response to Message 3417.  

Yep, you can see the problem in the task log http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=177516. Not sure why this is happening but it will be a general issue and not specific the the Theory app.

2016-05-20 10:03:36 (2992): Adding VirtualBox Guest Additions to VM.
2016-05-20 10:03:36 (2992): Error 0x80bb0001 in vbox50::VBOX_VM::create_vm (c:\src\boinc\boinc\samples\vboxwrapper\vbox_mscom_impl.cpp:710)
2016-05-20 10:03:36 (2992): Error Source : VirtualBoxWrap
2016-05-20 10:03:36 (2992): Error Description: Cannot register the DVD image 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' {41753763-2a21-43b1-a3fc-3231b85d41ae} because a CD/DVD image 'C:\Program Files\Oracle\VirtualBox/VBoxGuestAdditions.iso' with UUID {c88268ba-20b4-4b5c-8dcb-f4304ea7cb7d} already exists
2016-05-20 10:03:37 (2992): Powering off VM.
ID: 3423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3424 - Posted: 20 May 2016, 14:39:46 UTC - in response to Message 3423.  

Thanks.
So, it is a vbox issue, as it should be able to handle this.
ID: 3424 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3639 - Posted: 9 Jul 2016, 19:56:26 UTC
Last modified: 9 Jul 2016, 20:04:46 UTC

I had a few tasks, were all the consoles did not show anything (apart from the dummy message).
No link to the logs through " show graphics".

Only the TOP command shows a process "java <defunct>" with high cpu usage and NO MEMORY usage (no RES,VIRT or otherwise).
Running as a first job in the task for more than 120min and no sign of terminating itself.

http://lhcathomedev.cern.ch/vLHCathome-dev/workunit.php?wuid=208983

EDIT:I am going to gracefully shutdown the task at 20.30UTC.
If anyone wants any more info, please ask before then.
ID: 3639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3643 - Posted: 11 Jul 2016, 20:06:38 UTC

Are we still testing?
Has anyone seen the kind of fault, i described below?
I call it "java <defunct>" fault.

There are also still the occasional ultra long jobs, that exceed the 18h limit.
Is anything done about it?
ID: 3643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3649 - Posted: 12 Jul 2016, 19:01:54 UTC - in response to Message 3643.  

Has anyone seen the kind of fault, i described below?
I call it "java <defunct>" fault.

I've seen this only once. That was with the single ALICE task I ran.
The "java <defunct>" process shown in the 'top' window, was using almost 100% CPU, but it was a so called 'Zombie'-process.
That means that no normal process was running and the VM also couldn't create the 'shutdown' message into the shared folder after the 'normal' 12 hours runtime.
ID: 3649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 3650 - Posted: 12 Jul 2016, 19:12:18 UTC - in response to Message 3643.  
Last modified: 12 Jul 2016, 19:14:21 UTC

Has anyone seen the kind of fault, i described below?
I call it "java <defunct>" fault.

Like this?
This occurred on startup of a theory task in the beta project - seems to be stuck.
ID: 3650 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3651 - Posted: 12 Jul 2016, 19:14:34 UTC - in response to Message 3649.  

Thanks Crystal.

I have had this at least 5 or 6 times.
Next time, i will save the logs(they are actually there, contrary to what i said before)
ID: 3651 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3652 - Posted: 12 Jul 2016, 19:33:06 UTC - in response to Message 3650.  
Last modified: 12 Jul 2016, 19:35:00 UTC

Yes, m,
exactly like this.

This only happens a the beginning of a tasks.
Once a task starts processing the first job, everything should be OK.
ID: 3652 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3653 - Posted: 12 Jul 2016, 19:48:51 UTC
Last modified: 12 Jul 2016, 19:58:44 UTC

Just got another one.EDIT: and an other one.
Start.log:

07/12/16 21:40:25 Authorizations yet to be resolved:
07/12/16 21:40:25 allow READ: condor@fsauth/277-617-1197.277-617-1197 root@fsauth/277-617-1197.277-617-1197
07/12/16 21:40:25 allow WRITE: submit-side@matchsession/alicondorce01.cern.ch submit-side@matchsession/188.184.187.167 central-manager@277-617-1197/alicondor01.277-617-1197 condor@fsauth/* submit-side@matchsession/188.185.164.231 submit-side@matchsession/188.185.164.229 */10.0.11.58 submit-side@matchsession/ce504.cern.ch submit-side@matchsession/ce502.cern.ch submit-side@matchsession/ce503.cern.ch submit-side@matchsession/ce501.cern.ch worker-node@277-617-1197/277-617-1197.277-617-1197 central-manager@277-617-1197/277-617-1197.277-617-1197 computing-element@277-617-1197/277-617-1197.277-617-1197 condor@277-617-1197/277-617-1197.277-617-1197 root@fsauth/277-617-1197.277-617-1197
07/12/16 21:40:25 deny WRITE: anonymous@*/* *@unmapped/*
07/12/16 21:40:25 allow NEGOTIATOR: central-manager@277-617-1197/alicondor01.277-617-1197
07/12/16 21:40:25 deny NEGOTIATOR: anonymous@*/* *@unmapped/*
07/12/16 21:40:25 allow ADMINISTRATOR: worker-node@277-617-1197/277-617-1197.277-617-1197 central-manager@277-617-1197/277-617-1197.277-617-1197 computing-element@277-617-1197/277-617-1197.277-617-1197 condor@277-617-1197/277-617-1197.277-617-1197 condor@fsauth/277-617-1197.277-617-1197 */128.142.154.202 central-manager@277-617-1197/alicondor01.277-617-1197 */batchman.cern.ch
ID: 3653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3654 - Posted: 12 Jul 2016, 20:58:50 UTC - in response to Message 3649.  

I think the "java " processes can be cleaned up by tweaking the Condor configuration. Will look into it tomorrow.
ID: 3654 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 3661 - Posted: 13 Jul 2016, 11:00:33 UTC - in response to Message 3654.  

Thanks, Laurence. These zombies are associated with many of the 206 errors I see.
Hopefully your fix will make a difference.
ID: 3661 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 3769 - Posted: 21 Jul 2016, 22:30:57 UTC - in response to Message 3654.  

I think the "java <defunct>" processes can be cleaned up by tweaking the Condor configuration. Will look into it tomorrow.


These failures are still occurring. Has the "FIX" been done?
ID: 3769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3806 - Posted: 25 Jul 2016, 11:59:46 UTC - in response to Message 3769.  

I have just committed a fix for the "java " fault. The problem was that when Condor starts, it calls the JVM to publish some information about the Java environment in a ClassAd and this call hangs. The solution is just to tell Condor that there is no Java on the machine.
ID: 3806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3808 - Posted: 25 Jul 2016, 14:02:56 UTC

What exactly is "lost ratio".
What is lost and where?

http://mcplots-dev.cern.ch/production.php?view=status&plots=daily#plots
ID: 3808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 3819 - Posted: 26 Jul 2016, 9:16:35 UTC - in response to Message 3806.  

I have just committed a fix for the "java <defunct>" fault. The problem was that when Condor starts, it calls the JVM to publish some information about the Java environment in a ClassAd and this call hangs. The solution is just to tell Condor that there is no Java on the machine.


Excellent, thanks Laurence.
ID: 3819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3820 - Posted: 26 Jul 2016, 19:55:23 UTC

I set the preferences on the account to 1 task 1 core.
It downloads 3 tasks (i have 4 cores)
Starting from scratch, it started a tasks with 2 cores (boinc lists 1.5 cores).

Shutting down boinc and restart. Finishing the task---next one runs on 2 cores again.

In other words, single core operation does not work.
ID: 3820 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3821 - Posted: 26 Jul 2016, 20:03:46 UTC

In preferences I set:

The maximum number of tasks per host 2
The maximum number of cores per task 0


and after each change I updated my client.

I still get 8 tasks.
ID: 3821 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3822 - Posted: 26 Jul 2016, 20:30:00 UTC
Last modified: 26 Jul 2016, 20:41:31 UTC

I detached from the project and reattached.

In preferences I set:

The maximum number of tasks per host 0
The maximum number of cores per task 0

Now, it is running 3 task with 2 cores each. (i only have 4 cores, so why does boinc allow it, as it thinks, it is running 3 * 1.5cores= 4.5?)

EDIT: app_config still lists everything with 1.

EDIT2: Re-reading the config file turns 2 of the 3 task to "waiting".

EDIT 3: setting app_config theory tasks to 3 and re-read turns 2 tasks running, one waiting, as it should.
ID: 3822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Task startup issue


©2024 CERN