Thread 'Theory native - Wrong Runtime Reporting'

Author	Message
computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6117 - Posted: 28 Feb 2019, 13:05:31 UTC estigate the suspicious runtime reporting mentioned here I captured the scheduler_request files that are sent to the server when the client reports a task. An example looks as follows: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755871 [pre]<result> <name>Theory_2279-789277-24_0</name> <final_cpu_time>3680.719000</final_cpu_time> <final_elapsed_time>3040.000000</final_elapsed_time> <exit_status>0</exit_status> <state>5</state> <platform>x86_64-pc-linux-gnu</platform> <version_num>418</version_num> <plan_class>native_theory</plan_class> <final_peak_working_set_size>92508160</final_peak_working_set_size> <final_peak_swap_size>417730560</final_peak_swap_size> <final_peak_disk_usage>4961035</final_peak_disk_usage> <app_version_num>418</app_version_num> <stderr_out> <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> 12:34:14 (26060): wrapper (7.7.26015): starting 12:34:14 (26060): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.24 () 11:34:14 2019-02-28: cranky-0.0.24: [INFO] Detected Theory App 11:34:14 2019-02-28: cranky-0.0.24: [INFO] Checking CVMFS. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Checking runc. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Creating the filesystem. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Updating config.json. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Running Container 'runc'. ===> [runRivet] Thu Feb 28 11:34:16 UTC 2019 [boinc pp jets 8000 25 - pythia6 6.428 392 100000 24] 12:24:52 2019-02-28: cranky-0.0.24: [INFO] Preparing output. 13:24:53 (26060): cranky exited; CPU time 3680.719021 13:24:53 (26060): called boinc_finish(0) </stderr_txt> ]]> </stderr_out> <file_info> <name>Theory_2279-789277-24_0_r1627165468_result</name> <nbytes>27769.000000</nbytes> <max_nbytes>500000000.000000</max_nbytes> <md5_cksum>d14c090edd4e76b5c69cf60159775cb0</md5_cksum> <upload_url>http://lhcathome-test.cern.ch/lhcathome-dev_cgi/file_upload_handler</upload_url> </file_info> </result>[/pre] Instead of the reported runtime of 3040 s mentioned in <final_elapsed_time> the server DB shows 3,680.72 s. Thus it's most likely a server issue instead of a client issue. I also checked ATLAS native on the same host (separate client but same client version and same wrapper version) with tasks from lhc-production. The runtimes reported there seem to be ok. Thus I suggest the validation scripts used for Theory native should be reworked. ID: 6117 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6118 - Posted: 28 Feb 2019, 14:14:06 UTC - in response to Message 6117. As far I could see within your results, almost all tasks have the same cpu- and run-time. Very rarely the cpu-time is lower, but I couldn't find a task with higher cpu-time than run-time. In my list of tasks, I have all three combinations. Cpu-time lower, equal and higher. I hope you have set in preferences # of CPUs to 1 and are not using an app_config.xml. ID: 6118 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6119 - Posted: 28 Feb 2019, 14:34:21 UTC - in response to Message 6118. k the task details do not tell the truth. When the client reports a result it sends 2 time values included in the scheduler_request. Example: [pre]<final_cpu_time>3680.719000</final_cpu_time> <final_elapsed_time>3040.000000</final_elapsed_time>[/pre] <final_elapsed_time> matches the runtime that can be calculated from the timestamps in the logfile (wrapper ... starting / called boinc_finish). Thus the BOINC client seems to be correct. Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page? As I wrote, ATLAS native from lhc-production reports it as expected. #CPUs is set to 1. app_config.xml only limits the max #of concurrently running tasks. ID: 6119 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6120 - Posted: 28 Feb 2019, 15:03:31 UTC - in response to Message 6119. Last modified: 28 Feb 2019, 15:08:03 UTC Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page? The problem is, it sometimes does and sometimes does not. Examples: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755821 and https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755682 You seem to have configured your machine 3406 with 3 cores. Could you keep that 3 cores, but run only 2 tasks at a time and see what is stored when cpu-time probably is much higher than run-time? ID: 6120 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6122 - Posted: 1 Mar 2019, 12:32:56 UTC A single core VM running this task. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121 CPU-time lower than elapsed and correctly stored. A dual core VM running 1 task. Using the 2nd core for several job-related processes. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182 CPU-time higher than elapsed and correctly stored. ID: 6122 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 803 Credit: 4,286,286 RAC: 2,664	Message 6123 - Posted: 1 Mar 2019, 13:05:11 UTC - in response to Message 6122. Theory have one job per task. Is it useful to take more than one CPU in preferences? We get max. 6 tasks per Computer for testing this project. So, I think there is no win of performance with more than one CPU, or..? ID: 6123 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6124 - Posted: 1 Mar 2019, 13:42:19 UTC - in response to Message 6122. A single core VM running this task. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121 CPU-time lower than elapsed and correctly stored. A dual core VM running 1 task. Using the 2nd core for several job-related processes. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182 CPU-time higher than elapsed and correctly stored. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121 echo $(( $(date -d "11:23:41" +%s) - $(date -d "08:06:43" +%s) )) 11818 3 * 3600 + 16 * 60 + 21 = 11781 11818 - 11781 = 37 => 37 s missing https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182 echo $(( $(date -d "13:22:58" +%s) - $(date -d "11:34:07" +%s) )) 6531 1 * 3600 + 49 * 60 + 5 = 6545 6531 - 6545 = -14 => 14 s too much ID: 6124 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6125 - Posted: 1 Mar 2019, 13:50:00 UTC - in response to Message 6123. So, I think there is no win of performance with more than one CPU, or..? When you want to do a maximum number of jobs, 1 CPU/job is the best choice. When you want to run jobs faster, you could leave 1 or 2 cores free. Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process. During the event processing phase, there are several processes per job active, so they could use free cores to speed up the job. ID: 6125 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6126 - Posted: 1 Mar 2019, 14:31:06 UTC - in response to Message 6123. ]Theory have one job per task. Is it useful to take more than one CPU in preferences? We get max. 6 tasks per Computer for testing this project. So, I think there is no win of performance with more than one CPU, or..?[/quote] To get a list of your running BOINC clients run "pgrep -lx boinc". The output includes the client's PID: 25611 boinc Then run (using this PID) "pstree -pnl 25611". The output looks similar to this: [pre]boinc(25611)─┬─wrapper_26015_x(14325)─┬─{wrapper_26015_x}(14326) │ └─cranky-0.0.24(14327)───runc(15342)─┬─{runc}(15343) │ ├─{runc}(15344) │ ├─{runc}(15345) │ ├─{runc}(15346) │ ├─{runc}(15347) │ ├─{runc}(15348) │ ├─{runc}(15350) │ ├─{runc}(15351) │ └─job(15353)───runRivet.sh(15391)─┬─rungen.sh(17809)───Herwig++(17923) │ ├─rivetvm.exe(17810) │ ├─runRivet.sh(17811)───sleep(84751) │ └─sleep(86070) ├─{boinc}(25673) ├─wrapper_26015_x(65640)─┬─{wrapper_26015_x}(65642) │ └─cranky-0.0.24(65643)───runc(66703)─┬─{runc}(66704) │ ├─{runc}(66705) │ ├─{runc}(66706) │ ├─{runc}(66707) │ ├─{runc}(66708) │ ├─{runc}(66711) │ ├─{runc}(66713) │ └─job(66714)───runRivet.sh(66747)─┬─rungen.sh(68041)───pythia8.exe(68556) │ ├─rivetvm.exe(68042) │ ├─runRivet.sh(68043)───sleep(85043) │ └─sleep(85461) └─wrapper_26015_x(74089)─┬─{wrapper_26015_x}(74090) └─cranky-0.0.24(74091)───runc(75302)─┬─{runc}(75303) ├─{runc}(75304) ├─{runc}(75305) ├─{runc}(75306) ├─{runc}(75307) ├─{runc}(75308) ├─{runc}(75310) ├─job(75312)───runRivet.sh(75352)─┬─rungen.sh(80840)───agile-runmc(80946) │ ├─rivetvm.exe(80841) │ ├─runRivet.sh(80842)───plotter.exe(85958) │ └─sleep(86064) └─{runc}(75321)[/pre] Here you see 3 singlecore Theory native tasks (main program: wrapper...). Each of them has a couple of running child processes. CPU consuming are rivetvm.exe and the event generators, e.g. Herwig++. On a 1-core system all processes share that single CPU core. On a multicore system all processes share all CPU cores. Hence the CPU-time can be larger than the runtime on a multicore system. ATM Theory native is designed to be a 1-core app. If you configure 2 cores for your BOINC client, it will reserve 2 cores but each task will start only 1 cranky. ID: 6126 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 803 Credit: 4,286,286 RAC: 2,664	Message 6127 - Posted: 1 Mar 2019, 14:43:51 UTC - in response to Message 6125. Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process. During the event processing phase, there are several processes per job active, so they could use free cores to speed up the job. Have a AMD FX-8Core running with 8-Core use and 6 Theory-tasks parallel in ScientificLinux 7.6: https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3728 So, I think Linux take the CPU it need to finish those tasks. ID: 6127 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,240 RAC: 69	Message 6128 - Posted: 1 Mar 2019, 17:16:30 UTC During this task I paused my Linux-VM for about 30 minutes and probably more. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756253 ID: 6128 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 6149 - Posted: 5 Mar 2019, 16:49:44 UTC The wrong time reporting disappears if the computer is heavily overloaded or (in my case) something like a kernel quota simulates a heavy overload. Method: - Theory native client is configured to run 4 tasks concurrently - Kernel CPUquota limits this client to max. 3.6 CPUs (90%) - Theory native client shares a slice with other BOINC clients (ATLAS, Theory vbox) - Those other clients get higher CPUShares (especially Theory vbox to avoid the watchdog timer issue) Examples: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757396 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757427 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757393 Runtime and CPU-time now match the values that are reported via scheduler_request. ID: 6149 · Rating: 0 · rate: / Reply Quote

Development for LHC@home