Message boards : Theory Application : Theory native - Wrong Runtime Reporting
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 38
Message 6117 - Posted: 28 Feb 2019, 13:05:31 UTC

To investigate the suspicious runtime reporting mentioned here I captured the scheduler_request files that are sent to the server when the client reports a task.

An example looks as follows:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755871
<result>
    <name>Theory_2279-789277-24_0</name>
    <final_cpu_time>3680.719000</final_cpu_time>
    <final_elapsed_time>3040.000000</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>5</state>
    <platform>x86_64-pc-linux-gnu</platform>
    <version_num>418</version_num>
    <plan_class>native_theory</plan_class>
    <final_peak_working_set_size>92508160</final_peak_working_set_size>
    <final_peak_swap_size>417730560</final_peak_swap_size>
    <final_peak_disk_usage>4961035</final_peak_disk_usage>
    <app_version_num>418</app_version_num>
<stderr_out>
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
12:34:14 (26060): wrapper (7.7.26015): starting
12:34:14 (26060): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.24 ()
11:34:14 2019-02-28: cranky-0.0.24: [INFO] Detected Theory App
11:34:14 2019-02-28: cranky-0.0.24: [INFO] Checking CVMFS.
11:34:16 2019-02-28: cranky-0.0.24: [INFO] Checking runc.
11:34:16 2019-02-28: cranky-0.0.24: [INFO] Creating the filesystem.
11:34:16 2019-02-28: cranky-0.0.24: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
11:34:16 2019-02-28: cranky-0.0.24: [INFO] Updating config.json.
11:34:16 2019-02-28: cranky-0.0.24: [INFO] Running Container 'runc'.
===> [runRivet] Thu Feb 28 11:34:16 UTC 2019 [boinc pp jets 8000 25 - pythia6 6.428 392 100000 24]
12:24:52 2019-02-28: cranky-0.0.24: [INFO] Preparing output.
13:24:53 (26060): cranky exited; CPU time 3680.719021
13:24:53 (26060): called boinc_finish(0)

</stderr_txt>
]]>
</stderr_out>
<file_info>
    <name>Theory_2279-789277-24_0_r1627165468_result</name>
    <nbytes>27769.000000</nbytes>
    <max_nbytes>500000000.000000</max_nbytes>
    <md5_cksum>d14c090edd4e76b5c69cf60159775cb0</md5_cksum>
    <upload_url>http://lhcathome-test.cern.ch/lhcathome-dev_cgi/file_upload_handler</upload_url>
</file_info>
</result>

Instead of the reported runtime of 3040 s mentioned in <final_elapsed_time> the server DB shows 3,680.72 s.
Thus it's most likely a server issue instead of a client issue.


I also checked ATLAS native on the same host (separate client but same client version and same wrapper version) with tasks from lhc-production.
The runtimes reported there seem to be ok.
Thus I suggest the validation scripts used for Theory native should be reworked.
ID: 6117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 6118 - Posted: 28 Feb 2019, 14:14:06 UTC - in response to Message 6117.  

As far I could see within your results, almost all tasks have the same cpu- and run-time.
Very rarely the cpu-time is lower, but I couldn't find a task with higher cpu-time than run-time.
In my list of tasks, I have all three combinations. Cpu-time lower, equal and higher.

I hope you have set in preferences # of CPUs to 1 and are not using an app_config.xml.
ID: 6118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 38
Message 6119 - Posted: 28 Feb 2019, 14:34:21 UTC - in response to Message 6118.  

I think the task details do not tell the truth.

When the client reports a result it sends 2 time values included in the scheduler_request.
Example:
<final_cpu_time>3680.719000</final_cpu_time>
<final_elapsed_time>3040.000000</final_elapsed_time>

<final_elapsed_time> matches the runtime that can be calculated from the timestamps in the logfile (wrapper ... starting / called boinc_finish).
Thus the BOINC client seems to be correct.

Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page?
As I wrote, ATLAS native from lhc-production reports it as expected.

#CPUs is set to 1.
app_config.xml only limits the max #of concurrently running tasks.
ID: 6119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 6120 - Posted: 28 Feb 2019, 15:03:31 UTC - in response to Message 6119.  
Last modified: 28 Feb 2019, 15:08:03 UTC

Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page?
The problem is, it sometimes does and sometimes does not.
Examples: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755821
and https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755682

You seem to have configured your machine 3406 with 3 cores.
Could you keep that 3 cores, but run only 2 tasks at a time and see what is stored when cpu-time probably is much higher than run-time?
ID: 6120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 6122 - Posted: 1 Mar 2019, 12:32:56 UTC

A single core VM running this task.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121
CPU-time lower than elapsed and correctly stored.

A dual core VM running 1 task. Using the 2nd core for several job-related processes.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182
CPU-time higher than elapsed and correctly stored.
ID: 6122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,803,242
RAC: 2,551
Message 6123 - Posted: 1 Mar 2019, 13:05:11 UTC - in response to Message 6122.  

Theory have one job per task.
Is it useful to take more than one CPU in preferences?
We get max. 6 tasks per Computer for testing this project.
So, I think there is no win of performance with more than one CPU, or..?
ID: 6123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 38
Message 6124 - Posted: 1 Mar 2019, 13:42:19 UTC - in response to Message 6122.  

A single core VM running this task.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121
CPU-time lower than elapsed and correctly stored.

A dual core VM running 1 task. Using the 2nd core for several job-related processes.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182
CPU-time higher than elapsed and correctly stored.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121
echo $(( $(date -d "11:23:41" +%s) - $(date -d "08:06:43" +%s) ))
11818

3 * 3600 + 16 * 60 + 21 = 11781

11818 - 11781 = 37
=> 37 s missing


https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182
echo $(( $(date -d "13:22:58" +%s) - $(date -d "11:34:07" +%s) ))
6531

1 * 3600 + 49 * 60 + 5 = 6545

6531 - 6545 = -14
=> 14 s too much
ID: 6124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 6125 - Posted: 1 Mar 2019, 13:50:00 UTC - in response to Message 6123.  

So, I think there is no win of performance with more than one CPU, or..?
When you want to do a maximum number of jobs, 1 CPU/job is the best choice.
When you want to run jobs faster, you could leave 1 or 2 cores free.
Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process.
During the event processing phase, there are several processes per job active, so they could use free cores to speed up the job.
ID: 6125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 38
Message 6126 - Posted: 1 Mar 2019, 14:31:06 UTC - in response to Message 6123.  

Theory have one job per task.
Is it useful to take more than one CPU in preferences?
We get max. 6 tasks per Computer for testing this project.
So, I think there is no win of performance with more than one CPU, or..?

To get a list of your running BOINC clients run "pgrep -lx boinc".
The output includes the client's PID:
25611 boinc

Then run (using this PID) "pstree -pnl 25611".
The output looks similar to this:
boinc(25611)─┬─wrapper_26015_x(14325)─┬─{wrapper_26015_x}(14326)
             │                        └─cranky-0.0.24(14327)───runc(15342)─┬─{runc}(15343)
             │                                                             ├─{runc}(15344)
             │                                                             ├─{runc}(15345)
             │                                                             ├─{runc}(15346)
             │                                                             ├─{runc}(15347)
             │                                                             ├─{runc}(15348)
             │                                                             ├─{runc}(15350)
             │                                                             ├─{runc}(15351)
             │                                                             └─job(15353)───runRivet.sh(15391)─┬─rungen.sh(17809)───Herwig++(17923)
             │                                                                                               ├─rivetvm.exe(17810)
             │                                                                                               ├─runRivet.sh(17811)───sleep(84751)
             │                                                                                               └─sleep(86070)
             ├─{boinc}(25673)
             ├─wrapper_26015_x(65640)─┬─{wrapper_26015_x}(65642)
             │                        └─cranky-0.0.24(65643)───runc(66703)─┬─{runc}(66704)
             │                                                             ├─{runc}(66705)
             │                                                             ├─{runc}(66706)
             │                                                             ├─{runc}(66707)
             │                                                             ├─{runc}(66708)
             │                                                             ├─{runc}(66711)
             │                                                             ├─{runc}(66713)
             │                                                             └─job(66714)───runRivet.sh(66747)─┬─rungen.sh(68041)───pythia8.exe(68556)
             │                                                                                               ├─rivetvm.exe(68042)
             │                                                                                               ├─runRivet.sh(68043)───sleep(85043)
             │                                                                                               └─sleep(85461)
             └─wrapper_26015_x(74089)─┬─{wrapper_26015_x}(74090)
                                      └─cranky-0.0.24(74091)───runc(75302)─┬─{runc}(75303)
                                                                           ├─{runc}(75304)
                                                                           ├─{runc}(75305)
                                                                           ├─{runc}(75306)
                                                                           ├─{runc}(75307)
                                                                           ├─{runc}(75308)
                                                                           ├─{runc}(75310)
                                                                           ├─job(75312)───runRivet.sh(75352)─┬─rungen.sh(80840)───agile-runmc(80946)
                                                                           │                                 ├─rivetvm.exe(80841)
                                                                           │                                 ├─runRivet.sh(80842)───plotter.exe(85958)
                                                                           │                                 └─sleep(86064)
                                                                           └─{runc}(75321)


Here you see 3 singlecore Theory native tasks (main program: wrapper...).
Each of them has a couple of running child processes.
CPU consuming are rivetvm.exe and the event generators, e.g. Herwig++.
On a 1-core system all processes share that single CPU core.
On a multicore system all processes share all CPU cores.
Hence the CPU-time can be larger than the runtime on a multicore system.

ATM Theory native is designed to be a 1-core app.
If you configure 2 cores for your BOINC client, it will reserve 2 cores but each task will start only 1 cranky.
ID: 6126 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,803,242
RAC: 2,551
Message 6127 - Posted: 1 Mar 2019, 14:43:51 UTC - in response to Message 6125.  

Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process.
During the event processing phase, there are several processes per job active, so they could use free cores to speed up the job.

Have a AMD FX-8Core running with 8-Core use and 6 Theory-tasks parallel in ScientificLinux 7.6:
https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3728
So, I think Linux take the CPU it need to finish those tasks.
ID: 6127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 266
Message 6128 - Posted: 1 Mar 2019, 17:16:30 UTC

During this task I paused my Linux-VM for about 30 minutes and probably more.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756253
ID: 6128 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 38
Message 6149 - Posted: 5 Mar 2019, 16:49:44 UTC

The wrong time reporting disappears if the computer is heavily overloaded or (in my case) something like a kernel quota simulates a heavy overload.

Method:
- Theory native client is configured to run 4 tasks concurrently
- Kernel CPUquota limits this client to max. 3.6 CPUs (90%)
- Theory native client shares a slice with other BOINC clients (ATLAS, Theory vbox)
- Those other clients get higher CPUShares (especially Theory vbox to avoid the watchdog timer issue)

Examples:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757396
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757427
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757393

Runtime and CPU-time now match the values that are reported via scheduler_request.
ID: 6149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : Theory native - Wrong Runtime Reporting


©2024 CERN