Message boards : Theory Application : Veeerrrry long Pythia8
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 8297 - Posted: 19 Jan 2024, 20:01:08 UTC
Last modified: 19 Jan 2024, 20:02:14 UTC

These are often very long but I got home today to find
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2377057 has been running for 17hrs already. I check those I find in case they need to be resuscitated but this one is actually running fine and still writing to logs, just very slowly (46mins per 100 events 🐌)
It's running Fri Jan 19 01:22:14 UTC 2024 [boinc pp jets 13000 300 - pythia8 8.243 CP1-CR1 100000 34] so it's a 100,000 event job but has so far completed only 3500 so it would only be about 50% complete at the 10 day timeout. If I run it to timeout, will it send back anything useful that it has done and that it is only 50% complete and resend as a 50,000 event job or resend again as a 100,000 so the next recipient will have the same problem?

I wonder if there is any value in me extending its job duration to 20 days and letting it run to completion? I wouldn't get any credit for running it to Timeout or beyond but I wonder if its final output would still be of value?
I'm tempted just to kill it now and get a replacement that will complete in a timely fashion.
ID: 8297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,324,905
RAC: 1,506
Message 8298 - Posted: 20 Jan 2024, 0:29:05 UTC - in response to Message 8297.  

As long as the log says it is running it is ok and I have some over 7 days long with the appropriate Valid credits

Here is one

Computer ID 4952
Run time 7 days 0 hours 7 min 46 sec
CPU time 7 days 0 hours 13 min 28 sec
Validate state Valid
Credit 6,931.31
ID: 8298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 8300 - Posted: 21 Jan 2024, 16:25:12 UTC

Problem solved ... sort of.
I gently suspended both running tasks individually and shut down BOINC to do something else and, on restart of BOINC, the other task picked up from where it left off but this one started from scratch again so absolutely no chance of it finishing within deadline so I've killed it. Hopefully the next recipient will be a whizzier machine and be able to compete it timeously.
ID: 8300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 51
Credit: 602,329
RAC: 0
Message 8308 - Posted: 28 Jan 2024, 12:05:13 UTC - in response to Message 8300.  

I have this (pythia = native theory ? or not ?)

jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope
● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0
     Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient)
  Transient: yes
     Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 5 days ago
      Tasks: 18 (limit: 4575)
     Memory: 13.2M
        CPU: 2d 17h 32min 6.371s
     CGroup: /system.slice/Theory_2687-2578997-62_0.scope
             ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0
             ├─1050768 /bin/bash ./job
             ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc
             ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.si>
             ├─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params
             └─3492720 sleep 3

The end of the stderr.txt is

05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] mcplots runspec: boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
05:40:39 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] ----,^^^^,<<<~_____---,^^^,<<~____--,^^,<~__;_
10:49:43 UTC +00:00 2024-01-23: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
07:51:32 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
07:59:44 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
12:38:18 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
15:12:07 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
17:47:19 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
18:05:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
21:05:55 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
23:04:40 UTC +00:00 2024-01-24: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
00:51:22 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
05:17:15 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
10:00:50 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
11:08:51 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
13:55:01 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
15:45:35 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
18:52:32 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
23:17:08 UTC +00:00 2024-01-25: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
03:58:00 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
05:28:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
07:49:50 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope
10:43:51 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Pausing systemd unit Theory_2687-2578997-62_0.scope
15:54:17 UTC +00:00 2024-01-26: cranky-0.1.4: [INFO] Resuming systemd unit Theory_2687-2578997-62_0.scope

So the latest even reported is 2 days ago

Running for 5 days but reporting 2 days of CPU... I had a similar discussion on the LHC forum and computezrmle I should let them run (on LHC),

Theory tasks run between a few minutes (min) and 10 days (max).
This depends on the task's input data.
Locate "runRivet.log" below the worker slot to check how many events a task is configured to process and how many are already done.
Together with the time already used you can estimate the remaining runtime.
BOINC is not aware of those numbers, hence presents fake estimates based on averages.
This fact has also been discussed many times in this forum.

but I'm worried about the time spent against the CPU used, it seems there is something wrong ? (this debian host has been running for several weeks without rebooting now), I don't know what to do and if this the same as you are discussing in this topic ?
ID: 8308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 51
Credit: 602,329
RAC: 0
Message 8313 - Posted: 1 Feb 2024, 13:29:36 UTC - in response to Message 8308.  

The thing is still running

jerome@VM-Debian-OVH2:~$ systemctl status Theory_2687-2578997-62_0.scope
● Theory_2687-2578997-62_0.scope - /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0
     Loaded: loaded (/run/systemd/transient/Theory_2687-2578997-62_0.scope; transient)
  Transient: yes
     Active: active (running) since Tue 2024-01-23 05:40:39 UTC; 1 week 2 days ago
      Tasks: 18 (limit: 4575)
     Memory: 13.1M
        CPU: 6d 15h 35min 11.842s
     CGroup: /system.slice/Theory_2687-2578997-62_0.scope
             ├─1045599 sleep 3
             ├─1050756 /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2687-2578997-62_0
             ├─1050768 /bin/bash ./job
             ├─1050794 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051042 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051043 /bin/bash ./rungen.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62 /shared/tmp/tmp.siGUtYGTlG/generator.hepmc
             ├─1051044 /bin/bash ./runRivet.sh boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62
             ├─1051045 /shared/rivetvm/rivetvm.exe -a CMS_2012_I1090423 -i /shared/tmp/tmp.siGUtYGTlG/generator.hepmc -o /shared/tmp/tmp.siGUtYGTlG/flat -H /shared/tmp/tmp.siGUtYGTlG/generator.yoda -d /sh>
             └─1051106 /cvmfs/sft.cern.ch/lcg/external/MCGenerators_hepmc2.06.05/sherpa/1.4.1/x86_64-slc5-gcc43-opt/bin/Sherpa -f /shared/tmp/tmp.siGUtYGTlG/generator.params
ID: 8313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 8314 - Posted: 1 Feb 2024, 14:30:33 UTC - in response to Message 8313.  

Sherpas sometimes get stuck in endless loops.


Check the "runRivet.log" from that task.

Has something been written to the log within the last "mmin" minutes (720 min = 12 h)?
Be aware! This command will also show logs from tasks that are suspended for roughly the same time span.
find /path_to_your/BOINC_working_directory/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}"



Now, use the path to the log from above for the next command.
This shows how many events are already processed (or nothing).
grep -m1 'processed' <(tac /path_to_your_logfile)

Compare that number with the task's total events and the actual runtime of the task.
This allows to calculate the estimated total runtime.
If that is far more than 10 days the task will not finish within the deadline and should be cancelled.

Be aware! Do not use the runtime estimates shown by BOINC.
BOINC doesn't know anything about the internal structure/logs of Theory tasks.
ID: 8314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 51
Credit: 602,329
RAC: 0
Message 8315 - Posted: 1 Feb 2024, 16:27:35 UTC - in response to Message 8314.  
Last modified: 1 Feb 2024, 16:28:39 UTC

Hi computezrmle

I know that slot is 4 so I did the 1st command directly there (was it a good idea ?)

jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots/4 -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}"
[sudo] password for jerome: 
===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62]
-rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log

and then

jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo grep -m1 'processed' <(tac /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log)
grep: /dev/fd/63: No such file or directory

Does this mean "the task is not processed" ? (because we know this already, I'm not sure what you are searching there)

I tried again the first command exactly as you said (without the slot)

jerome@VM-Debian-OVH2:/var/lib/boinc-client/slots/4$ sudo find /var/lib/boinc-client/slots -type f -name "runRivet.log" -mmin +720 |xargs -I {} bash -c "head -n1 {}; ls -hal {}"
===> [runRivet] Fri Jan 26 18:10:26 UTC 2024 [boinc ppbar mb-inelastic 63 - - pythia8 8.303 dire-default 100000 60]
-rw-r--r-- 1 boinc boinc 33K Jan 26 18:10 /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log
===> [runRivet] Sun Jan 28 06:15:30 UTC 2024 [boinc pp winclusive 7000 20 - sherpa 2.2.7 default 100000 70]
These are the 3 tasks that have been running for a long time (I only mentioned the longest one, but out of 4 tasks I currently have, 3 seem to be long runners 
-rw-r--r-- 1 boinc boinc 20K Jan 28 06:21 /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log
===> [runRivet] Tue Jan 23 05:40:39 UTC 2024 [boinc pp jets 7000 80,-,1460 - sherpa 1.4.1 default 100000 62]
-rw-r--r-- 1 boinc boinc 43K Jan 23 05:56 /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log

Out of 4 tasks I have running at the moment, 3 are such "long runners" it seems... (I know this, I only mentioned the 1st one but the other 2 have been running for days also, but less time)

So for now I cannot "Compare that number with the task's total events and the actual runtime of the task."
ID: 8315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 8316 - Posted: 1 Feb 2024, 17:21:06 UTC - in response to Message 8315.  

Why "sudo"?
Since you are already in "/var/lib/boinc-client/slots/4" it looks like you have the necessary access rights to list the dirs/files, don't you?

My 1st command is already explained:
Look for a "runRivet.log" not modified recently (recently can even be 1-x hours) as this might indicate a task being either in an endless loop or pausing.


The 2nd should be self explaining as it uses basic Linux commands.
Make yourself familiar with those commands as they are widely used.
Here the command is used to locate the last entry of the "processed" pattern in "runRivet.log".
The result should look like:
20100 events processed

A typical 1st line in "runRivet.log" looks like:
===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78]
The bold number tells you how many events are to be processed.
Here: 100000

So, roughly 20% of the task is done.
Now, look at the task's walltime, say (example): 21 hours

=> This task has another 84 hours to go.
=> It will finish before the 10-days-limit.


If the oneliners I suggested don't work for you (e.g. due to missing access rights) feel free to copy the log to a folder where you have full rights and use an editor to look into the copy.
ID: 8316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 51
Credit: 602,329
RAC: 0
Message 8317 - Posted: 1 Feb 2024, 20:41:37 UTC - in response to Message 8316.  
Last modified: 1 Feb 2024, 20:42:20 UTC

No I could not run any of the commands without sudo, it would throw me an error :/ (except moving into the folder with cd, this works without sudo)

I'm sorry I'm not good enough in linux and I cannot invest too much time to become an expert... boinc is already quite a time consuming hobby.

I'm not sure to understand if the 1st comment could run (using sudo) why it would be a problem ? it wouldn't return the same result ?

Regarding the 2nd command since I don't understand why the "( )" and why the "tac /" I just removed them and this time I got a result :

jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log
0 events processed

Like this time it can find the file and it's looking for "processed" in it. And it says "no event".

So I tried the same command with the other two long tasks returned by the 1st command

jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log
0 events processed
jerome@VM-Debian-OVH2:~$ sudo grep -m1 'processed' < /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log
jerome@VM-Debian-OVH2:~$ 

The one in slot 0 is also "0 event" and the one in slot 1 returns nothing.

So I guess all this mean these tasks are not doing anything since the very beginning ?

So I must cancel them right ? :(
ID: 8317 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 8318 - Posted: 1 Feb 2024, 21:14:45 UTC - in response to Message 8317.  

I don't understand why ... I just removed them and this time I got a result

Well, that's your problem!
You checked for the first line showing "processed" instead of the last line.


This forum is not the place to give basic lessons.
Neither for Linux nor for Windows.
If you don't want to invest the time to learn the most simple basics it might be better you don't try to "analyse" anything.

Instead of:
sudo grep -m1 'processed' < /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log

run this:
sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log
ID: 8318 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 51
Credit: 602,329
RAC: 0
Message 8319 - Posted: 1 Feb 2024, 21:37:53 UTC

jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/4/cernvm/shared/runRivet.log
0 events processed
100 events processed
200 events processed
300 events processed
400 events processed
500 events processed
600 events processed
jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/0/cernvm/shared/runRivet.log
0 events processed
jerome@VM-Debian-OVH2:~$ sudo grep 'processed' /var/lib/boinc-client/slots/1/cernvm/shared/runRivet.log
jerome@VM-Debian-OVH2:~$ 

So only slot 4 has some events processed, the others have have none. A good waste of computing power it seems.

Compare that number with the task's total events and the actual runtime of the task.
This allows to calculate the estimated total runtime.

If you know what is the task's total events. But I don't.

But you know what, don't worry with me anymore : I cancelled the 4 tasks and switched to another project, I guess you have enough linux expert volunteers to help you here (and pay for the computing power) to test your applications.
ID: 8319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Aug 22
Posts: 22
Credit: 63,680
RAC: 1
Message 8321 - Posted: 5 Feb 2024, 9:45:09 UTC - in response to Message 8319.  

A typical 1st line in "runRivet.log" looks like:
===> [runRivet] Tue Jan 30 20:52:08 UTC 2024 [boinc pp jets 13000 1500 - pythia8 8.301 CP1-CR1 100000 78]
The bold number tells you how many events are to be processed.
Here: 100000
ID: 8321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : Veeerrrry long Pythia8


©2024 CERN