Message boards : Theory Application : Open Issues
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 186
Message 2951 - Posted: 22 Apr 2016, 18:12:38 UTC
Last modified: 22 Apr 2016, 20:00:26 UTC

Here is the list of open issues. If there is something that is not listed, please post to get it added.

  • Suspend/Resume
  • Job Final State Analysis (and show result in logs/console)
  • Backoff on Errors
  • Download tasks according to requested time (workbuffer) and not a fixed number

ID: 2951 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2952 - Posted: 22 Apr 2016, 18:40:36 UTC
Last modified: 22 Apr 2016, 18:52:19 UTC

I have allocated 2 cores to the task.
The load average is very high (15min average somtimes up to 1.81)
I can only speculate how high it might be with just one core.
(Maybe , i try that next)

This is not a fault as such, but an efficiency issue.

Other tasks (CMS, Atlas) are not anywere near that bad, under the same conditions.

Finished jobs should show in the stderr with pass/fail status.

I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job.

EDIT:Not to forget: Download tasks according to requested time (workbuffer) and not a fixed number.
ID: 2952 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 186
Message 2953 - Posted: 22 Apr 2016, 19:43:16 UTC - in response to Message 2952.  

I have allocated 2 cores to the task.
The load average is very high (15min average somtimes up to 1.81)
I can only speculate how high it might be with just one core.
(Maybe , i try that next)

With 2 cores, 1.81 seems fine. The workload may just be adapting. I would suggest trying with 1 core.

Finished jobs should show in the stderr with pass/fail status.

Will extend the job analysis description

I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job.

Does this happen in the T4T production version? If so, please let me know where you see this.

EDIT:Not to forget: Download tasks according to requested time (workbuffer) and not a fixed number.

Will add this.
ID: 2953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2955 - Posted: 22 Apr 2016, 19:58:02 UTC


I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job.


Does this happen in the T4T production version? If so, please let me know where you see this.


No, this is just for fault finding. There may only be certain apps causing certain issues, which we cannot tell, if we do not know, which app it is.

They used to be in the logs a few days ago, but things change...
ID: 2955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 186
Message 2956 - Posted: 22 Apr 2016, 20:02:31 UTC - in response to Message 2955.  

The logs should be back. A few things broke while re-factoring today. If they are not there in new tasks from now on, let me know.
ID: 2956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2969 - Posted: 23 Apr 2016, 7:17:32 UTC

I have allocated 2 cores to the task.
The load average is very high (15min average somtimes up to 1.81)
I can only speculate how high it might be with just one core.
(Maybe , i try that next)


I have tested it with only one core.
Load average, as expected, very high. (1.85- 1.93 15min load average).

This is with agile-runmc app on two tasks.

The apps are still nowhere to be found in the logs
ID: 2969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,971
RAC: 1,723
Message 2976 - Posted: 23 Apr 2016, 20:14:41 UTC

Last 4 tasks ended in computation error with no reason for me.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155757

2016-04-23 21:35:59 (5040): Status Report: CPU Time: '35317.362792'
2016-04-23 22:03:57 (5040): Guest Log: [ERROR] Condor exited with 0
2016-04-23 22:03:57 (5040): Guest Log: [INFO] Shutting Down.
2016-04-23 22:03:57 (5040): VM Completion File Detected.
2016-04-23 22:03:57 (5040): VM Completion Message: Condor exited with 0
.
2016-04-23 22:03:57 (5040): Powering off VM.
2016-04-23 22:03:59 (5040): Successfully stopped VM.
2016-04-23 22:04:04 (5040): Deregistering VM. (boinc_394876cc1189c4ec, slot#1)
2016-04-23 22:04:04 (5040): Removing virtual disk drive(s) from VM.
2016-04-23 22:04:04 (5040): Removing network bandwidth throttle group from VM.
2016-04-23 22:04:04 (5040): Removing storage controller(s) from VM.
2016-04-23 22:04:04 (5040): Removing VM from VirtualBox.
22:04:09 (5040): called boinc_finish(1)
ID: 2976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2977 - Posted: 23 Apr 2016, 20:22:52 UTC
Last modified: 23 Apr 2016, 20:32:27 UTC

Task finished after about 10h
Computation error

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155454

Everything seemed fine, last job successful.

Similar to Crystal Pellet.

2 more tasks failed with computation error after 10h to 10h30min runtime.
ID: 2977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2978 - Posted: 23 Apr 2016, 20:36:00 UTC
Last modified: 23 Apr 2016, 20:38:48 UTC

The odd thing is the first few lines in the stderr.


<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion.
(0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
2016-04-23 12:22:05 (2472): vboxwrapper (7.7.26184): starting



Function not permitted---??????

EDIT: My last valid result did not have that.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155532
ID: 2978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 186
Message 2979 - Posted: 23 Apr 2016, 20:53:45 UTC - in response to Message 2976.  

Sorry about that. We should give credit in this case. Need to double-check what is going on when Condor exits.
ID: 2979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 2980 - Posted: 23 Apr 2016, 20:56:09 UTC - in response to Message 2969.  
Last modified: 23 Apr 2016, 20:56:47 UTC

I have allocated 2 cores to the task.
The load average is very high (15min average somtimes up to 1.81)
I can only speculate how high it might be with just one core.
(Maybe , i try that next)


I have tested it with only one core.
Load average, as expected, very high. (1.85- 1.93 15min load average).

This is with agile-runmc app on two tasks.

The apps are still nowhere to be found in the logs

Theory apps are dual-threaded, but the second thread is only used for graphics generation and uses less than half a CPU. In the past (back in the days of cernvmwrapper) we tried allocating 2 cores per task but discontinued it as it wasted half a CPU on average.

The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it.
ID: 2980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2981 - Posted: 23 Apr 2016, 21:04:11 UTC - in response to Message 2980.  

The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it.


I am just concerned, that 1 core is doing work, where nearly 2 would be needed.

It is just eighter wasting some cpu (2 cores) or slowing down (1 core) the task quite a bit.
ID: 2981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3042 - Posted: 26 Apr 2016, 19:12:55 UTC

Thanks for adding the finished x.log files.
Now we can see, if a job passed or failed.
ID: 3042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 3086 - Posted: 28 Apr 2016, 20:58:16 UTC
Last modified: 28 Apr 2016, 20:59:13 UTC

Over at VirtualLHC, the 32bit app has a base memory requirement of only 256MB. Challenge 64bit runs happily with 512MB per core. The apps here are requesting 2GB, triggering the need for app_configs to limit the number of tasks running so as not to overburden contributors' hosts. Is there any prospect of reducing the VM memory requirement before these 64s get released to production?
(question applies equally for CMS and ATLAS)
ID: 3086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 186
Message 3090 - Posted: 28 Apr 2016, 22:29:42 UTC - in response to Message 3086.  

Theory yes, CMS and ATLAS no.
ID: 3090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : Open Issues


©2024 CERN