Message boards :
Theory Application :
Veeerrrry long Pythia8
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 |
These are often very long but I got home today to find https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2377057 has been running for 17hrs already. I check those I find in case they need to be resuscitated but this one is actually running fine and still writing to logs, just very slowly (46mins per 100 events 🐌) It's running Fri Jan 19 01:22:14 UTC 2024 [boinc pp jets 13000 300 - pythia8 8.243 CP1-CR1 100000 34] so it's a 100,000 event job but has so far completed only 3500 so it would only be about 50% complete at the 10 day timeout. If I run it to timeout, will it send back anything useful that it has done and that it is only 50% complete and resend as a 50,000 event job or resend again as a 100,000 so the next recipient will have the same problem? I wonder if there is any value in me extending its job duration to 20 days and letting it run to completion? I wouldn't get any credit for running it to Timeout or beyond but I wonder if its final output would still be of value? I'm tempted just to kill it now and get a replacement that will complete in a timely fashion. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,406,736 RAC: 6,383 |
As long as the log says it is running it is ok and I have some over 7 days long with the appropriate Valid credits Here is one Computer ID 4952 Run time 7 days 0 hours 7 min 46 sec CPU time 7 days 0 hours 13 min 28 sec Validate state Valid Credit 6,931.31 |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 |
Problem solved ... sort of. I gently suspended both running tasks individually and shut down BOINC to do something else and, on restart of BOINC, the other task picked up from where it left off but this one started from scratch again so absolutely no chance of it finishing within deadline so I've killed it. Hopefully the next recipient will be a whizzier machine and be able to compete it timeously. |
Send message Joined: 17 Mar 15 Posts: 51 Credit: 602,329 RAC: 0 |
I have this (pythia = native theory ? or not ?) jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope ● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient) Transient: yes Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 5 days ago Tasks: 18 (limit: 4575) Memory: 13.2M CPU: 2d 17h 32min 6.371s CGroup: /system.slice/Theory_2687-2578997-62_0.scope ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 ├─1050768 /bin/bash ./job ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.si> ├─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params └─3492720 sleep 3 The end of the stderr.txt is 05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] mcplots runspec: boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] ----,^^^^,<<<~_____---,^^^,<<~____--,^^,<~__;_ 10:49:43 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 07:51:32 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 07:59:44 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 12:38:18 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 15:12:07 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 17:47:19 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 18:05:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 21:05:55 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 23:04:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 00:51:22 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 05:17:15 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 10:00:50 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 11:08:51 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 13:55:01 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 15:45:35 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 18:52:32 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 23:17:08 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 03:58:00 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 05:28:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 07:49:50 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope 10:43:51 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope 15:54:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope So the latest even reported is 2 days ago Running for 5 days but reporting 2 days of CPU... I had a similar discussion on the LHC forum and computezrmle I should let them run (on LHC), Theory tasks run between a few minutes (min) and 10 days (max). but I'm worried about the time spent against the CPU used, it seems there is something wrong ? (this debian host has been running for several weeks without rebooting now), I don't know what to do and if this the same as you are discussing in this topic ? |
Send message Joined: 17 Mar 15 Posts: 51 Credit: 602,329 RAC: 0 |
The thing is still running jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope ● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient) Transient: yes Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 1 week 2 days ago Tasks: 18 (limit: 4575) Memory: 13.1M CPU: 6d 15h 35min 11.842s CGroup: /system.slice/Theory_2687-2578997-62_0.scope ├─1045599 sleep 3 ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0 ├─1050768 /bin/bash ./job ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.siGUtYGTlG/generator.yoda -d /sh> └─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Sherpas sometimes get stuck in endless loops. Check the "runRivet.log" from that task. Has something been written to the log within the last "mmin" minutes (720 min = 12 h)? Be aware! This command will also show logs from tasks that are suspended for roughly the same time span. find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}" Now, use the path to the log from above for the next command. This shows how many events are already processed (or nothing). grep -m1 'processed' <(tac /path_to_your_logfile) Compare that number with the task's total events and the actual runtime of the task. This allows to calculate the estimated total runtime. If that is far more than 10 days the task will not finish within the deadline and should be cancelled. Be aware! Do not use the runtime estimates shown by BOINC. BOINC doesn't know anything about the internal structure/logs of Theory tasks. |
Send message Joined: 17 Mar 15 Posts: 51 Credit: 602,329 RAC: 0 |
Hi computezrmle I know that slot is 4 so I did the 1st command directly there (was it a good idea ?) jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots/4 -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}" [sudo] password for jerome: ===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62] -rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log and then jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo grep -m1 'processed' <(tac /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log) grep: /dev/fd/63: No such file or directory Does this mean "the task is not processed" ? (because we know this already, I'm not sure what you are searching there) I tried again the first command exactly as you said (without the slot) jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}" ===> [runRivet] Fri Jan 26 18:10:26 UTC 2024 [boinc ppbar mb-inelastic 63 - - pythia8 8.303 dire-default 100000 60] -rw-r--r-- 1 boinc boinc 33K Jan 26 18:10 /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log ===> [runRivet] Sun Jan 28 06:15:30 UTC 2024 [boinc pp winclusive 7000 20 - sherpa 2.2.7 default 100000 70] These are the 3 tasks that have been running for a long time (I only mentioned the longest one, but out of 4 tasks I currently have, 3 seem to be long runners -rw-r--r-- 1 boinc boinc 20K Jan 28 06:21 /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log ===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62] -rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log Out of 4 tasks I have running at the moment, 3 are such "long runners" it seems... (I know this, I only mentioned the 1st one but the other 2 have been running for days also, but less time) So for now I cannot "Compare that number with the task's total events and the actual runtime of the task." |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Why "sudo"? Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you? My 1st command is already explained: Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing. The 2nd should be self explaining as it uses basic Linux commands. Make yourself familiar with those commands as they are widely used. Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log". The result should look like: 20100 events processed A typical 1st line in "runRivet.log" looks like: ===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78] The bold number tells you how many events are to be processed. Here: 100000 So, roughly 20% of the task is done. Now, look at the task's walltime, say (example): 21 hours => This task has another 84 hours to go. => It will finish before the 10-days-limit. If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy. |
Send message Joined: 17 Mar 15 Posts: 51 Credit: 602,329 RAC: 0 |
No I could not run any of the commands without sudo, it would throw me an error :/ (except moving into the folder with cd, this works without sudo) I'm sorry I'm not good enough in linux and I cannot invest too much time to become an expert... boinc is already quite a time consuming hobby. I'm not sure to understand if the 1st comment could run (using sudo) why it would be a problem ? it wouldn't return the same result ? Regarding the 2nd command since I don't understand why the "( )" and why the "tac /" I just removed them and this time I got a result : jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log 0 events processed Like this time it can find the file and it's looking for "processed" in it. And it says "no event". So I tried the same command with the other two long tasks returned by the 1st command jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log 0 events processed jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log jerome@VM-Debian-OVH2:~$ The one in slot 0 is also "0 event" and the one in slot 1 returns nothing. So I guess all this mean these tasks are not doing anything since the very beginning ? So I must cancel them right ? :( |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
I don't understand why ... I just removed them and this time I got a result Well, that's your problem! You checked for the first line showing "processed" instead of the last line. This forum is not the place to give basic lessons. Neither for Linux nor for Windows. If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything. Instead of: sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log run this: sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log |
Send message Joined: 17 Mar 15 Posts: 51 Credit: 602,329 RAC: 0 |
jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log 0 events processed 100 events processed 200 events processed 300 events processed 400 events processed 500 events processed 600 events processed jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log 0 events processed jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log jerome@VM-Debian-OVH2:~$ So only slot 4 has some events processed, the others have have none. A good waste of computing power it seems. Compare that number with the task's total events and the actual runtime of the task. If you know what is the task's total events. But I don't. But you know what, don't worry with me anymore : I cancelled the 4 tasks and switched to another project, I guess you have enough linux expert volunteers to help you here (and pay for the computing power) to test your applications. |
Send message Joined: 22 Aug 22 Posts: 22 Credit: 63,680 RAC: 0 |
A typical 1st line in "runRivet.log" looks like: |
©2024 CERN