Message boards :
Theory Application :
Theory native - Wrong Runtime Reporting
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
To investigate the suspicious runtime reporting mentioned here I captured the scheduler_request files that are sent to the server when the client reports a task. An example looks as follows: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755871 <result> <name>Theory_2279-789277-24_0</name> <final_cpu_time>3680.719000</final_cpu_time> <final_elapsed_time>3040.000000</final_elapsed_time> <exit_status>0</exit_status> <state>5</state> <platform>x86_64-pc-linux-gnu</platform> <version_num>418</version_num> <plan_class>native_theory</plan_class> <final_peak_working_set_size>92508160</final_peak_working_set_size> <final_peak_swap_size>417730560</final_peak_swap_size> <final_peak_disk_usage>4961035</final_peak_disk_usage> <app_version_num>418</app_version_num> <stderr_out> <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> 12:34:14 (26060): wrapper (7.7.26015): starting 12:34:14 (26060): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.24 () 11:34:14 2019-02-28: cranky-0.0.24: [INFO] Detected Theory App 11:34:14 2019-02-28: cranky-0.0.24: [INFO] Checking CVMFS. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Checking runc. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Creating the filesystem. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Updating config.json. 11:34:16 2019-02-28: cranky-0.0.24: [INFO] Running Container 'runc'. ===> [runRivet] Thu Feb 28 11:34:16 UTC 2019 [boinc pp jets 8000 25 - pythia6 6.428 392 100000 24] 12:24:52 2019-02-28: cranky-0.0.24: [INFO] Preparing output. 13:24:53 (26060): cranky exited; CPU time 3680.719021 13:24:53 (26060): called boinc_finish(0) </stderr_txt> ]]> </stderr_out> <file_info> <name>Theory_2279-789277-24_0_r1627165468_result</name> <nbytes>27769.000000</nbytes> <max_nbytes>500000000.000000</max_nbytes> <md5_cksum>d14c090edd4e76b5c69cf60159775cb0</md5_cksum> <upload_url>http://lhcathome-test.cern.ch/lhcathome-dev_cgi/file_upload_handler</upload_url> </file_info> </result> Instead of the reported runtime of 3040 s mentioned in <final_elapsed_time> the server DB shows 3,680.72 s. Thus it's most likely a server issue instead of a client issue. I also checked ATLAS native on the same host (separate client but same client version and same wrapper version) with tasks from lhc-production. The runtimes reported there seem to be ok. Thus I suggest the validation scripts used for Theory native should be reworked. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
As far I could see within your results, almost all tasks have the same cpu- and run-time. Very rarely the cpu-time is lower, but I couldn't find a task with higher cpu-time than run-time. In my list of tasks, I have all three combinations. Cpu-time lower, equal and higher. I hope you have set in preferences # of CPUs to 1 and are not using an app_config.xml. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
I think the task details do not tell the truth. When the client reports a result it sends 2 time values included in the scheduler_request. Example: <final_cpu_time>3680.719000</final_cpu_time> <final_elapsed_time>3040.000000</final_elapsed_time> <final_elapsed_time> matches the runtime that can be calculated from the timestamps in the logfile (wrapper ... starting / called boinc_finish). Thus the BOINC client seems to be correct. Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page? As I wrote, ATLAS native from lhc-production reports it as expected. #CPUs is set to 1. app_config.xml only limits the max #of concurrently running tasks. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
Now, why does the server not report <final_elapsed_time> as "runtime" on the task detail page?The problem is, it sometimes does and sometimes does not. Examples: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755821 and https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2755682 You seem to have configured your machine 3406 with 3 cores. Could you keep that 3 cores, but run only 2 tasks at a time and see what is stored when cpu-time probably is much higher than run-time? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
A single core VM running this task. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121 CPU-time lower than elapsed and correctly stored. A dual core VM running 1 task. Using the 2nd core for several job-related processes. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182 CPU-time higher than elapsed and correctly stored. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
Theory have one job per task. Is it useful to take more than one CPU in preferences? We get max. 6 tasks per Computer for testing this project. So, I think there is no win of performance with more than one CPU, or..? |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
A single core VM running this task. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756121 echo $(( $(date -d "11:23:41" +%s) - $(date -d "08:06:43" +%s) )) 11818 3 * 3600 + 16 * 60 + 21 = 11781 11818 - 11781 = 37 => 37 s missing https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756182 echo $(( $(date -d "13:22:58" +%s) - $(date -d "11:34:07" +%s) )) 6531 1 * 3600 + 49 * 60 + 5 = 6545 6531 - 6545 = -14 => 14 s too much |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
So, I think there is no win of performance with more than one CPU, or..?When you want to do a maximum number of jobs, 1 CPU/job is the best choice. When you want to run jobs faster, you could leave 1 or 2 cores free. Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process. During the event processing phase, there are several processes per job active, so they could use free cores to speed up the job. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
Theory have one job per task. To get a list of your running BOINC clients run "pgrep -lx boinc". The output includes the client's PID: 25611 boinc Then run (using this PID) "pstree -pnl 25611". The output looks similar to this: boinc(25611)─┬─wrapper_26015_x(14325)─┬─{wrapper_26015_x}(14326) │ └─cranky-0.0.24(14327)───runc(15342)─┬─{runc}(15343) │ ├─{runc}(15344) │ ├─{runc}(15345) │ ├─{runc}(15346) │ ├─{runc}(15347) │ ├─{runc}(15348) │ ├─{runc}(15350) │ ├─{runc}(15351) │ └─job(15353)───runRivet.sh(15391)─┬─rungen.sh(17809)───Herwig++(17923) │ ├─rivetvm.exe(17810) │ ├─runRivet.sh(17811)───sleep(84751) │ └─sleep(86070) ├─{boinc}(25673) ├─wrapper_26015_x(65640)─┬─{wrapper_26015_x}(65642) │ └─cranky-0.0.24(65643)───runc(66703)─┬─{runc}(66704) │ ├─{runc}(66705) │ ├─{runc}(66706) │ ├─{runc}(66707) │ ├─{runc}(66708) │ ├─{runc}(66711) │ ├─{runc}(66713) │ └─job(66714)───runRivet.sh(66747)─┬─rungen.sh(68041)───pythia8.exe(68556) │ ├─rivetvm.exe(68042) │ ├─runRivet.sh(68043)───sleep(85043) │ └─sleep(85461) └─wrapper_26015_x(74089)─┬─{wrapper_26015_x}(74090) └─cranky-0.0.24(74091)───runc(75302)─┬─{runc}(75303) ├─{runc}(75304) ├─{runc}(75305) ├─{runc}(75306) ├─{runc}(75307) ├─{runc}(75308) ├─{runc}(75310) ├─job(75312)───runRivet.sh(75352)─┬─rungen.sh(80840)───agile-runmc(80946) │ ├─rivetvm.exe(80841) │ ├─runRivet.sh(80842)───plotter.exe(85958) │ └─sleep(86064) └─{runc}(75321) Here you see 3 singlecore Theory native tasks (main program: wrapper...). Each of them has a couple of running child processes. CPU consuming are rivetvm.exe and the event generators, e.g. Herwig++. On a 1-core system all processes share that single CPU core. On a multicore system all processes share all CPU cores. Hence the CPU-time can be larger than the runtime on a multicore system. ATM Theory native is designed to be a 1-core app. If you configure 2 cores for your BOINC client, it will reserve 2 cores but each task will start only 1 cranky. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 2 |
Only Sherpa will not have so much profit from that, because of the (very) long initialization and integration phase, mostly done by one process. Have a AMD FX-8Core running with 8-Core use and 6 Theory-tasks parallel in ScientificLinux 7.6: https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3728 So, I think Linux take the CPU it need to finish those tasks. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
During this task I paused my Linux-VM for about 30 minutes and probably more. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2756253 |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
The wrong time reporting disappears if the computer is heavily overloaded or (in my case) something like a kernel quota simulates a heavy overload. Method: - Theory native client is configured to run 4 tasks concurrently - Kernel CPUquota limits this client to max. 3.6 CPUs (90%) - Theory native client shares a slice with other BOINC clients (ATLAS, Theory vbox) - Those other clients get higher CPUShares (especially Theory vbox to avoid the watchdog timer issue) Examples: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757396 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757427 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757393 Runtime and CPU-time now match the values that are reported via scheduler_request. |
©2024 CERN