Thread 'Veeerrrry long Pythia8'

Author	Message
Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 0	Message 8297 - Posted: 19 Jan 2024, 20:01:08 UTC Last modified: 19 Jan 2024, 20:02:14 UTC These are often very long but I got home today to find https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2377057 has been running for 17hrs already. I check those I find in case they need to be resuscitated but this one is actually running fine and still writing to logs, just very slowly (46mins per 100 events 🐌) It's running Fri Jan 19 01:22:14 UTC 2024 [boinc pp jets 13000 300 - pythia8 8.243 CP1-CR1 100000 34] so it's a 100,000 event job but has so far completed only 3500 so it would only be about 50% complete at the 10 day timeout. If I run it to timeout, will it send back anything useful that it has done and that it is only 50% complete and resend as a 50,000 event job or resend again as a 100,000 so the next recipient will have the same problem? I wonder if there is any value in me extending its job duration to 20 days and letting it run to completion? I wouldn't get any credit for running it to Timeout or beyond but I wonder if its final output would still be of value? I'm tempted just to kill it now and get a replacement that will complete in a timely fashion. ID: 8297 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 852 Credit: 16,269,351 RAC: 11,821	Message 8298 - Posted: 20 Jan 2024, 0:29:05 UTC - in response to Message 8297. As long as the log says it is running it is ok and I have some over 7 days long with the appropriate Valid credits Here is one Computer ID 4952 Run time 7 days 0 hours 7 min 46 sec CPU time 7 days 0 hours 13 min 28 sec Validate state Valid Credit 6,931.31 ID: 8298 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 0	Message 8300 - Posted: 21 Jan 2024, 16:25:12 UTC Problem solved ... sort of. I gently suspended both running tasks individually and shut down BOINC to do something else and, on restart of BOINC, the other task picked up from where it left off but this one started from scratch again so absolutely no chance of it finishing within deadline so I've killed it. Hopefully the next recipient will be a whizzier machine and be able to compete it timeously. ID: 8300 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 17 Mar 15 Posts: 106 Credit: 1,038,379 RAC: 0	Message 8308 - Posted: 28 Jan 2024, 12:05:13 UTC - in response to Message 8300. I have this (pythia = native theory ? or not ?) jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope ● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient) Transient: yes Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 5 days ago Tasks: 18 (limit: 4575) Memory: 13.2M CPU: 2d 17h 32min 6.371s CGroup: /system.slice/Theory_2687-2578997-62_0.scope ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 ├─1050768 /bin/bash ./job ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.si> ├─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params └─3492720 sleep 3 The end of the stderr.txt is 05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] mcplots runspec: boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] ----,^^^^,<<<~_____---,^^^,<<~____--,^^,<~__;_ 10:49:43 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 07:51:32 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 07:59:44 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 12:38:18 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 15:12:07 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 17:47:19 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 18:05:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 21:05:55 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 23:04:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 00:51:22 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 05:17:15 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 10:00:50 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 11:08:51 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 13:55:01 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 15:45:35 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 18:52:32 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 23:17:08 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 03:58:00 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 05:28:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 07:49:50 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 10:43:51 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 15:54:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope So the latest even reported is 2 days ago Running for 5 days but reporting 2 days of CPU... I had a similar discussion on the LHC forum and computezrmle I should let them run (on LHC), Theory tasks run between a few minutes (min) and 10 days (max). This depends on the task's input data. Locate "runRivet.log" below the worker slot to check how many events a task is configured to process and how many are already done. Together with the time already used you can estimate the remaining runtime. BOINC is not aware of those numbers, hence presents fake estimates based on averages. This fact has also been discussed many times in this forum. but I'm worried about the time spent against the CPU used, it seems there is something wrong ? (this debian host has been running for several weeks without rebooting now), I don't know what to do and if this the same as you are discussing in this topic ? ID: 8308 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 17 Mar 15 Posts: 106 Credit: 1,038,379 RAC: 0	Message 8313 - Posted: 1 Feb 2024, 13:29:36 UTC - in response to Message 8308. The thing is still running jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope ● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient) Transient: yes Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 1 week 2 days ago Tasks: 18 (limit: 4575) Memory: 13.1M CPU: 6d 15h 35min 11.842s CGroup: /system.slice/Theory_2687-2578997-62_0.scope ├─1045599 sleep 3 ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 ├─1050768 /bin/bash ./job ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.siGUtYGTlG/generator.yoda -d /sh> └─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params ID: 8313 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 531 Credit: 400,710 RAC: 0	Message 8314 - Posted: 1 Feb 2024, 14:30:33 UTC - in response to Message 8313. s sometimes get stuck in endless loops. Check the "runRivet.log" from that task. Has something been written to the log within the last "mmin" minutes (720 min = 12 h)? Be aware! This command will also show logs from tasks that are suspended for roughly the same time span. [pre]find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 \|xargs -I {} bash -c "head -n1 {}; ls -hal {}"[/pre] Now, use the path to the log from above for the next command. This shows how many events are already processed (or nothing). [pre]grep -m1 'processed' <(tac /path_to_your_logfile)[/pre] Compare that number with the task's total events and the actual runtime of the task. This allows to calculate the estimated total runtime. If that is far more than 10 days the task will not finish within the deadline and should be cancelled. Be aware! Do not use the runtime estimates shown by BOINC. BOINC doesn't know anything about the internal structure/logs of Theory tasks. ID: 8314 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 17 Mar 15 Posts: 106 Credit: 1,038,379 RAC: 0	Message 8315 - Posted: 1 Feb 2024, 16:27:35 UTC - in response to Message 8314. Last modified: 1 Feb 2024, 16:28:39 UTC Hi computezrmle I know that slot is 4 so I did the 1st command directly there (was it a good idea ?) jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots/4 -type f -name "runRivet.log" -mmin +720 \|xargs -I {} bash -c "head -n1 {}; ls -hal {}" [sudo] password for jerome: ===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62] -rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log and then jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo grep -m1 'processed' <(tac /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log) grep: /dev/fd/63: No such file or directory Does this mean "the task is not processed" ? (because we know this already, I'm not sure what you are searching there) I tried again the first command exactly as you said (without the slot) jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots -type f -name "runRivet.log" -mmin +720 \|xargs -I {} bash -c "head -n1 {}; ls -hal {}" ===> [runRivet] Fri Jan 26 18:10:26 UTC 2024 [boinc ppbar mb-inelastic 63 - - pythia8 8.303 dire-default 100000 60] -rw-r--r-- 1 boinc boinc 33K Jan 26 18:10 /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log ===> [runRivet] Sun Jan 28 06:15:30 UTC 2024 [boinc pp winclusive 7000 20 - sherpa 2.2.7 default 100000 70] These are the 3 tasks that have been running for a long time (I only mentioned the longest one, but out of 4 tasks I currently have, 3 seem to be long runners -rw-r--r-- 1 boinc boinc 20K Jan 28 06:21 /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log ===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62] -rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log Out of 4 tasks I have running at the moment, 3 are such "long runners" it seems... (I know this, I only mentioned the 1st one but the other 2 have been running for days also, but less time) So for now I cannot "Compare that number with the task's total events and the actual runtime of the task." ID: 8315 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 531 Credit: 400,710 RAC: 0	Message 8316 - Posted: 1 Feb 2024, 17:21:06 UTC - in response to Message 8315. Why "sudo"? Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you? My 1st command is already explained: Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing. The 2nd should be self explaining as it uses basic Linux commands. Make yourself familiar with those commands as they are widely used. Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log". The result should look like: 20100 events processed A typical 1st line in "runRivet.log" looks like: ===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78] The bold number tells you how many events are to be processed. Here: 100000 So, roughly 20% of the task is done. Now, look at the task's walltime, say (example): 21 hours => This task has another 84 hours to go. => It will finish before the 10-days-limit. If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy. ID: 8316 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 17 Mar 15 Posts: 106 Credit: 1,038,379 RAC: 0	Message 8317 - Posted: 1 Feb 2024, 20:41:37 UTC - in response to Message 8316. Last modified: 1 Feb 2024, 20:42:20 UTC No I could not run any of the commands without sudo, it would throw me an error :/ (except moving into the folder with cd, this works without sudo) I'm sorry I'm not good enough in linux and I cannot invest too much time to become an expert... boinc is already quite a time consuming hobby. I'm not sure to understand if the 1st comment could run (using sudo) why it would be a problem ? it wouldn't return the same result ? Regarding the 2nd command since I don't understand why the "( )" and why the "tac /" I just removed them and this time I got a result : jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log 0 events processed Like this time it can find the file and it's looking for "processed" in it. And it says "no event". So I tried the same command with the other two long tasks returned by the 1st command jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log 0 events processed jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log jerome@VM-Debian-OVH2:~$ The one in slot 0 is also "0 event" and the one in slot 1 returns nothing. So I guess all this mean these tasks are not doing anything since the very beginning ? So I must cancel them right ? :( ID: 8317 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 531 Credit: 400,710 RAC: 0	Message 8318 - Posted: 1 Feb 2024, 21:14:45 UTC - in response to Message 8317. I don't understand why ... I just removed them and this time I got a result Well, that's your problem! You checked for the first line showing "processed" instead of the last line. This forum is not the place to give basic lessons. Neither for Linux nor for Windows. If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything. Instead of: sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log run this: sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log ID: 8318 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 17 Mar 15 Posts: 106 Credit: 1,038,379 RAC: 0	Message 8319 - Posted: 1 Feb 2024, 21:37:53 UTC jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log 0 events processed 100 events processed 200 events processed 300 events processed 400 events processed 500 events processed 600 events processed jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log 0 events processed jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log jerome@VM-Debian-OVH2:~$ So only slot 4 has some events processed, the others have have none. A good waste of computing power it seems. Compare that number with the task's total events and the actual runtime of the task. This allows to calculate the estimated total runtime. If you know what is the task's total events. But I don't. But you know what, don't worry with me anymore : I cancelled the 4 tasks and switched to another project, I guess you have enough linux expert volunteers to help you here (and pay for the computing power) to test your applications. ID: 8319 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 22 Aug 22 Posts: 44 Credit: 84,516 RAC: 14	Message 8321 - Posted: 5 Feb 2024, 9:45:09 UTC - in response to Message 8319. A typical 1st line in "runRivet.log" looks like: ===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78] The bold number tells you how many events are to be processed. Here: 100000 ID: 8321 · Rating: 0 · rate: / Reply Quote

Development for LHC@home