Message boards : Theory Application : Long Runners
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
rbpeake

Send message
Joined: 15 Apr 15
Posts: 45
Credit: 872,818
RAC: 5,144
Message 8835 - Posted: 16 Jun 2025, 14:54:09 UTC
Last modified: 16 Jun 2025, 14:55:48 UTC

[Windows.]
I have had several work units which have so far clocked over a day of runtime. I aborted one of them https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2507890 which I now see maeax has running. Will see if he has any issues. But I have other units that I have not aborted.

Shall I continue to run them?

Thanks.
ID: 8835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 767
Credit: 3,688,465
RAC: 13,107
Message 8836 - Posted: 16 Jun 2025, 15:24:19 UTC - in response to Message 8835.  

Have set this command in CMD:
cat /proc/filesystems | grep cgroup2

but, very, very difficult ftm to have the right instructions to let Docker running.
ID: 8836 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1251
Credit: 994,625
RAC: 538
Message 8838 - Posted: 19 Jun 2025, 6:56:00 UTC - in response to Message 8835.  

I aborted this one after running almost a day: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3435781
===> [runRivet] Wed Jun 18 07:55:45 UTC 2025 [boinc pp winclusive 13000 - - pythia8 8.303 dire-default 100000 8]

There was no further progress after the first 0 events processed, but was using 100% CPU.

The last lines of runRivet.log:

-------- PYTHIA Event Listing (hard process) -----------------------------------------------------------------------------------

no id name status mothers daughters colours p_x p_y p_z e m
0 90 (system) -11 0 0 0 0 0 0 0.000 0.000 0.000 13000.000 13000.000
1 2212 (p+) -12 0 0 3 0 0 0 0.000 0.000 6500.000 6500.000 0.938
2 2212 (p+) -12 0 0 4 0 0 0 0.000 0.000 -6500.000 6500.000 0.938
3 2 (u) -21 1 0 5 0 101 0 0.000 0.000 5.757 5.757 0.000
4 -1 (dbar) -21 2 0 5 0 0 101 0.000 0.000 -287.833 287.833 0.000
5 24 (W+) -22 3 4 6 7 0 0 0.000 0.000 -282.077 293.590 81.411
6 -13 mu+ 23 5 0 0 0 0 0 23.787 -32.540 -161.522 166.475 0.106
7 14 nu_mu 23 5 0 0 0 0 0 -23.787 32.540 -120.555 127.114 0.000
Charge sum: 1.000 Momentum sum: 0.000 0.000 -282.077 293.590 81.411

-------- End PYTHIA Event Listing -----------------------------------------------------------------------------------------------
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
Rivet.AnalysisHandler: INFO Only using nominal weight: variation weights will be ignored
0 events processed
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
PYTHIA Info in DireTimes::pT2nextQCD_FF: Found large acceptance weight for Dire_fsr_qed_11->11&22_notPartial
PYTHIA Warning in DireTimes::branch_FF: Reject state with kinematically forbidden kT^2.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
PYTHIA Error in Pythia::check: energy-momentum not conserved
PYTHIA Error in Pythia::next: check of event revealed problems
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
PYTHIA Warning in DireSpace::branch_IF: used up beam momentum; discard splitting.
WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added.
ID: 8838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 767
Credit: 3,688,465
RAC: 13,107
Message 8840 - Posted: 20 Jun 2025, 10:37:19 UTC

I have had several work units which have so far clocked over a day of runtime. I aborted one of them https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2507890 which I now see maeax has running. Will see if he has any issues. But I have other units that I have not aborted.

Endstatus 0 (0x00000000)
Computer ID 4861
Laufzeit 4 Tage 6 Stunden 16 min. 14 sek.
CPU Zeit 4 Tage 2 Stunden 24 min. 36 sek.
Prüfungsstatus Gültig

Long way, but successful!
ID: 8840 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 45
Credit: 872,818
RAC: 5,144
Message 8844 - Posted: 20 Jun 2025, 22:49:41 UTC - in response to Message 8840.  

Good to know, thanks!
ID: 8844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8872 - Posted: 5 Jul 2025, 10:35:03 UTC

I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern.

I had let it run because I read here that they could succeed, but now I have strong doubts...

I've seen some very long task with more than 3 days that did succeed (or others with more than 1 day like this one or that one).

I can see it is Theory Simulation 7.58 (vbox64_theory) when all the others are Theory Simulation 7.61 (vbox64_theory), so maybe the 7.58 version has been "excluded" somehow from the log, like deprecated ? and I still have a forgotten one (that will never get reported then) ?

I would make no point cancelling it now, I'll see what happens tomorrow.
ID: 8872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 767
Credit: 3,688,465
RAC: 13,107
Message 8873 - Posted: 5 Jul 2025, 10:48:50 UTC - in response to Message 8872.  

Do you see a .log in Darwin from this vbox-task?

Yes, only Vers.7.61 is avalaible, since last Thursday for all OS.
After 10 days runtime, shutdown come from the task.
ID: 8873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1251
Credit: 994,625
RAC: 538
Message 8875 - Posted: 5 Jul 2025, 11:37:14 UTC - in response to Message 8872.  

In reply to [AF>Le_Pommier] Jerome_C2005's message of 5 Jul 2025:
I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern.

This is the workunit your task is belonging to. It's still in progress, not pending: ==>> https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2502695
ID: 8875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fzs600

Send message
Joined: 11 Feb 15
Posts: 6
Credit: 2,036,721
RAC: 211
Message 8876 - Posted: 5 Jul 2025, 13:53:38 UTC - in response to Message 8872.  

In reply to [AF>Le_Pommier] Jerome_C2005's message of 5 Jul 2025:
I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern.

I had let it run because I read here that they could succeed, but now I have strong doubts...

I've seen some very long task with more than 3 days that did succeed (or others with more than 1 day like this one or that one).

I can see it is Theory Simulation 7.58 (vbox64_theory) when all the others are Theory Simulation 7.61 (vbox64_theory), so maybe the 7.58 version has been "excluded" somehow from the log, like deprecated ? and I still have a forgotten one (that will never get reported then) ?

I would make no point cancelling it now, I'll see what happens tomorrow.


i have 4 very long wu
13 days: 97%
10 days: 79%
8 days: 63%
and 7 days: 59%
ID: 8876 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8877 - Posted: 7 Jul 2025, 8:31:36 UTC - in response to Message 8875.  
Last modified: 7 Jul 2025, 8:31:58 UTC

In reply to Crystal Pellet's message of 5 Jul 2025:
This is the workunit your task is belonging to. It's still in progress, not pending: ==>> [url]https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2502695[/url
Oh really great, so it got cancelled by the project.

If you are generating such long runners it would be reasonable to have a trickle mechanism and extend the deadline, and not let them die after such a long running time. Yes I had checked the slot and there was life in the log file. I just saw these messages this morning and I'm not at home / no access to the computer for the moment, I'll only be able to cancel it tonight.

Or is this not supposed to happen and it is accidental ? like a combination of input parameters that makes it run much longer than expected ?

So what is the rule ? we must cancel them if they exceed ... a number of days ?
ID: 8877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1251
Credit: 994,625
RAC: 538
Message 8878 - Posted: 7 Jul 2025, 8:50:30 UTC - in response to Message 8877.  
Last modified: 7 Jul 2025, 8:54:54 UTC

The task is not cancelled by the project. It's timed out.
Normally a resend would be sent to another client, but this was already the 3rd try from this workunit (3=max), so the server will validate your result even when it's late.
There is really no rule, when one should cancel a longrunner. It's up to the user.
In the slot/shared folder there's the runRivet.log. There you could find how many events are processed so far and on the 1st line the maximum of events (mostly 100000).

The % progress shown in BOINC Manager is useless.
ID: 8878 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ProfileMagic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 806
Credit: 14,880,642
RAC: 13,104
Message 8879 - Posted: 7 Jul 2025, 9:34:35 UTC
Last modified: 7 Jul 2025, 9:35:51 UTC

Yeah some can run a long time and I have done many of those over the years.
I always watch the running logs to see if they actual stop running work but will keep running the task until 10 days.

I just tonight decided to abort a couple of them over at production that were running as far as the log files but were over 9 days so I could get back to CMS here

I tend to save info and I see one from June of last year that run Valid
Run time 8 days 18 hours 41 min 28 sec
CPU time 8 days 17 hours 14 min 35 sec
Validate state Valid
Credit 7,539.68


And several others running over 6 hours.......so it just depends on if you check the running log and see if it is actually working or if you don't want to run those long ones just abort it.
Mad Scientist For Life
ID: 8879 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8880 - Posted: 7 Jul 2025, 17:38:11 UTC
Last modified: 7 Jul 2025, 17:39:14 UTC

So the task is still running on my iMac and

- the runRivet.log in the slot/shared subfolder has not been updated since 26/06, the last entry is

Integrate 70 of 760:
integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
integrated ( 1.43752e-06 +/- 3.36695e-07 ) nb
epsilon = 0.000251302
---------------------------------------------------
integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 2
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
integrated ( 1.29986e-06 +/- 1.16361e-07 ) nb
epsilon = 0.00182504 chi2 = 0.189831
---------------------------------------------------
integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 3
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
integrated ( 1.27413e-06 +/- 6.98281e-08 ) nb
epsilon = 0.00216663 chi2 = 0.133114
---------------------------------------------------
integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 4
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**

- the stderr.txt in the slot folder is updated frequently (last update 1 hour ago)

2025-07-07 16:50:45 (1157): Status Report: Job Duration: '864000.000000'
2025-07-07 16:50:45 (1157): Status Report: Elapsed Time: '728770.878529'
2025-07-07 16:50:45 (1157): Status Report: CPU Time: '674297.420000'
2025-07-07 18:30:17 (1157): Status Report: Job Duration: '864000.000000'
2025-07-07 18:30:17 (1157): Status Report: Elapsed Time: '734770.878529'
2025-07-07 18:30:17 (1157): Status Report: CPU Time: '679996.710000'

- the 9 VBoxHeadless processes corresponding to the 9 running tasks are all eating CPU

What shall I do ?

I'm still asking because I read "so the server will validate your result even when it's late."
ID: 8880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 767
Credit: 3,688,465
RAC: 13,107
Message 8881 - Posted: 7 Jul 2025, 17:58:50 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2516998
Never ending task. Two before your task running, never ended.
ID: 8881 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1251
Credit: 994,625
RAC: 538
Message 8882 - Posted: 8 Jul 2025, 4:57:19 UTC - in response to Message 8880.  

In reply to [AF>Le_Pommier] Jerome_C2005's message of 7 Jul 2025:
What shall I do ?

I'm still asking because I read "so the server will validate your result even when it's late."
You probably don't see "..... events processed" in your runRivet.log.
It's better to abort that task.
I don't know the MAC-OS, but on Windows you have 3 VBoxHeadless.exe running for 1 task, whereof one process uses the most CPU.
ID: 8882 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8883 - Posted: 8 Jul 2025, 7:42:08 UTC - in response to Message 8882.  

In reply to Crystal Pellet's message of 8 Jul 2025:
You probably don't see "..... events processed" in your runRivet.log.
It's better to abort that task.
This I'll only be able to check and do tonight again, once back home.

I don't know the MAC-OS, but on Windows you have 3 VBoxHeadless.exe running for 1 task, whereof one process uses the most CPU.
I've always seen the same number of VBoxHeadless processes compared to the number of running task, and all using almost one CPU thread (for theory, other LHC apps have mt capacity)
ID: 8883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8885 - Posted: 8 Jul 2025, 17:47:25 UTC - in response to Message 8883.  
Last modified: 8 Jul 2025, 17:48:15 UTC

Now that I'm home I realize that

- task as jumped to 94,8% completion
- runRivet.log is updating again and is adding stars at the end the log all the time like this

Integrate 100 of 760:
integrating ME[cbar,s->e+,cbar,s,e-].SubtractedReal.SubtractionIntegral : cbar s -> e+ cbar s e- , iteration 1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
integrated ( 1.07461e-06 +/- 6.78547e-07 ) nb
epsilon = 0.00017382
---------------------------------------------------
integrating ME[cbar,s->e+,cbar,s,e-].SubtractedReal.SubtractionIntegral : cbar s -> e+ cbar s e- , iteration 2
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
*******
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
(the time I took to write this post and like 10 more stars were added)

So after 9 days 10 hours of calculation, claiming a remaining 12 hours, there is no point cancelling it now, we'll see what happens.
ID: 8885 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8889 - Posted: 9 Jul 2025, 19:07:31 UTC - in response to Message 8885.  

The task finally ended this morning, on my BoincTasks history it is marked as OK, on the website account is is "calculation error", no credit.

Don't trust the long running tasks.
ID: 8889 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8894 - Posted: 10 Jul 2025, 17:34:21 UTC

Jesus, another one, 5 days 18 hours 42% deadline = in 5 days but 10 hours before estimated end date...
ID: 8894 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 17 Mar 15
Posts: 93
Credit: 991,880
RAC: 7,930
Message 8906 - Posted: 12 Jul 2025, 11:10:58 UTC

It is epidemic



You can see the deadline is always 10 hours before the estimated end date !

They are all v7.61, unlike the huge one I mentioned above and that finally failed.

So what ? 7.61 tasks are more reliable and I can trust that they will end in time ?

Or I'd better cancel them always if they take like more than one day because "you never know if they will end", and then all this CPU crunch time will be lost forever in oblivion, systematically ?
ID: 8906 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Long Runners


©2025 CERN