Message boards : Theory Application : Long Runners
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Apr 15 Posts: 45 Credit: 872,818 RAC: 5,144 ![]() ![]() ![]() |
[Windows.] I have had several work units which have so far clocked over a day of runtime. I aborted one of them https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2507890 which I now see maeax has running. Will see if he has any issues. But I have other units that I have not aborted. Shall I continue to run them? Thanks. |
Send message Joined: 22 Apr 16 Posts: 767 Credit: 3,688,465 RAC: 13,107 ![]() ![]() ![]() |
Have set this command in CMD: cat /proc/filesystems | grep cgroup2 but, very, very difficult ftm to have the right instructions to let Docker running. |
Send message Joined: 13 Feb 15 Posts: 1251 Credit: 994,625 RAC: 538 ![]() ![]() |
I aborted this one after running almost a day: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3435781 ===> [runRivet] Wed Jun 18 07:55:45 UTC 2025 [boinc pp winclusive 13000 - - pythia8 8.303 dire-default 100000 8] There was no further progress after the first 0 events processed, but was using 100% CPU. The last lines of runRivet.log: -------- PYTHIA Event Listing (hard process) ----------------------------------------------------------------------------------- no id name status mothers daughters colours p_x p_y p_z e m 0 90 (system) -11 0 0 0 0 0 0 0.000 0.000 0.000 13000.000 13000.000 1 2212 (p+) -12 0 0 3 0 0 0 0.000 0.000 6500.000 6500.000 0.938 2 2212 (p+) -12 0 0 4 0 0 0 0.000 0.000 -6500.000 6500.000 0.938 3 2 (u) -21 1 0 5 0 101 0 0.000 0.000 5.757 5.757 0.000 4 -1 (dbar) -21 2 0 5 0 0 101 0.000 0.000 -287.833 287.833 0.000 5 24 (W+) -22 3 4 6 7 0 0 0.000 0.000 -282.077 293.590 81.411 6 -13 mu+ 23 5 0 0 0 0 0 23.787 -32.540 -161.522 166.475 0.106 7 14 nu_mu 23 5 0 0 0 0 0 -23.787 32.540 -120.555 127.114 0.000 Charge sum: 1.000 Momentum sum: 0.000 0.000 -282.077 293.590 81.411 -------- End PYTHIA Event Listing ----------------------------------------------------------------------------------------------- WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. Rivet.AnalysisHandler: INFO Only using nominal weight: variation weights will be ignored 0 events processed WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. PYTHIA Info in DireTimes::pT2nextQCD_FF: Found large acceptance weight for Dire_fsr_qed_11->11&22_notPartial PYTHIA Warning in DireTimes::branch_FF: Reject state with kinematically forbidden kT^2. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. PYTHIA Error in Pythia::check: energy-momentum not conserved PYTHIA Error in Pythia::next: check of event revealed problems WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. PYTHIA Warning in DireSpace::branch_IF: used up beam momentum; discard splitting. WARNING::ReaderAsciiHepMC2: weights are empty, an event weight 1.0 will be added. |
Send message Joined: 22 Apr 16 Posts: 767 Credit: 3,688,465 RAC: 13,107 ![]() ![]() ![]() |
I have had several work units which have so far clocked over a day of runtime. I aborted one of them https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2507890 which I now see maeax has running. Will see if he has any issues. But I have other units that I have not aborted. Endstatus 0 (0x00000000) Computer ID 4861 Laufzeit 4 Tage 6 Stunden 16 min. 14 sek. CPU Zeit 4 Tage 2 Stunden 24 min. 36 sek. Prüfungsstatus Gültig Long way, but successful! |
Send message Joined: 15 Apr 15 Posts: 45 Credit: 872,818 RAC: 5,144 ![]() ![]() ![]() |
Good to know, thanks! |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern. I had let it run because I read here that they could succeed, but now I have strong doubts... I've seen some very long task with more than 3 days that did succeed (or others with more than 1 day like this one or that one). I can see it is Theory Simulation 7.58 (vbox64_theory) when all the others are Theory Simulation 7.61 (vbox64_theory), so maybe the 7.58 version has been "excluded" somehow from the log, like deprecated ? and I still have a forgotten one (that will never get reported then) ? I would make no point cancelling it now, I'll see what happens tomorrow. |
Send message Joined: 22 Apr 16 Posts: 767 Credit: 3,688,465 RAC: 13,107 ![]() ![]() ![]() |
Do you see a .log in Darwin from this vbox-task? Yes, only Vers.7.61 is avalaible, since last Thursday for all OS. After 10 days runtime, shutdown come from the task. |
Send message Joined: 13 Feb 15 Posts: 1251 Credit: 994,625 RAC: 538 ![]() ![]() |
In reply to [AF>Le_Pommier] Jerome_C2005's message of 5 Jul 2025: I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern. This is the workunit your task is belonging to. It's still in progress, not pending: ==>> https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2502695 |
Send message Joined: 11 Feb 15 Posts: 6 Credit: 2,036,721 RAC: 211 ![]() |
In reply to [AF>Le_Pommier] Jerome_C2005's message of 5 Jul 2025: I have a very long runner, with 6 days 7 hours calculation time, 63% advancement (but the % has been moving up) and a deadline for tomorrow 05:40 called Theory_2914-4738167-19 that I cannot even find in the pending tasks on my account ! I checked all pending tasks one by one on my account, they are named with "Theory_2922-xxx" pattern, the other tasks I have on-going on my iMac do match this pattern. i have 4 very long wu 13 days: 97% 10 days: 79% 8 days: 63% and 7 days: 59% |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
In reply to Crystal Pellet's message of 5 Jul 2025: This is the workunit your task is belonging to. It's still in progress, not pending: ==>> [url]https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2502695[/urlOh really great, so it got cancelled by the project. If you are generating such long runners it would be reasonable to have a trickle mechanism and extend the deadline, and not let them die after such a long running time. Yes I had checked the slot and there was life in the log file. I just saw these messages this morning and I'm not at home / no access to the computer for the moment, I'll only be able to cancel it tonight. Or is this not supposed to happen and it is accidental ? like a combination of input parameters that makes it run much longer than expected ? So what is the rule ? we must cancel them if they exceed ... a number of days ? |
Send message Joined: 13 Feb 15 Posts: 1251 Credit: 994,625 RAC: 538 ![]() ![]() |
The task is not cancelled by the project. It's timed out. Normally a resend would be sent to another client, but this was already the 3rd try from this workunit (3=max), so the server will validate your result even when it's late. There is really no rule, when one should cancel a longrunner. It's up to the user. In the slot/shared folder there's the runRivet.log. There you could find how many events are processed so far and on the 1st line the maximum of events (mostly 100000). The % progress shown in BOINC Manager is useless. |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 806 Credit: 14,880,642 RAC: 13,104 ![]() ![]() ![]() |
Yeah some can run a long time and I have done many of those over the years. I always watch the running logs to see if they actual stop running work but will keep running the task until 10 days. I just tonight decided to abort a couple of them over at production that were running as far as the log files but were over 9 days so I could get back to CMS here I tend to save info and I see one from June of last year that run Valid Run time 8 days 18 hours 41 min 28 sec CPU time 8 days 17 hours 14 min 35 sec Validate state Valid Credit 7,539.68 And several others running over 6 hours.......so it just depends on if you check the running log and see if it is actually working or if you don't want to run those long ones just abort it. Mad Scientist For Life ![]() |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
So the task is still running on my iMac and - the runRivet.log in the slot/shared subfolder has not been updated since 26/06, the last entry is Integrate 70 of 760: integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** integrated ( 1.43752e-06 +/- 3.36695e-07 ) nb epsilon = 0.000251302 --------------------------------------------------- integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 2 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** integrated ( 1.29986e-06 +/- 1.16361e-07 ) nb epsilon = 0.00182504 chi2 = 0.189831 --------------------------------------------------- integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 3 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** integrated ( 1.27413e-06 +/- 6.98281e-08 ) nb epsilon = 0.00216663 chi2 = 0.133114 --------------------------------------------------- integrating ME[cbar,bbar->mu+,bbar,cbar,mu-].SubtractedReal.SubtractionIntegral : cbar bbar -> mu+ bbar cbar mu- , iteration 4 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| ** - the stderr.txt in the slot folder is updated frequently (last update 1 hour ago) 2025-07-07 16:50:45 (1157): Status Report: Job Duration: '864000.000000' 2025-07-07 16:50:45 (1157): Status Report: Elapsed Time: '728770.878529' 2025-07-07 16:50:45 (1157): Status Report: CPU Time: '674297.420000' 2025-07-07 18:30:17 (1157): Status Report: Job Duration: '864000.000000' 2025-07-07 18:30:17 (1157): Status Report: Elapsed Time: '734770.878529' 2025-07-07 18:30:17 (1157): Status Report: CPU Time: '679996.710000' - the 9 VBoxHeadless processes corresponding to the 9 running tasks are all eating CPU What shall I do ? I'm still asking because I read "so the server will validate your result even when it's late." |
Send message Joined: 22 Apr 16 Posts: 767 Credit: 3,688,465 RAC: 13,107 ![]() ![]() ![]() |
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2516998 Never ending task. Two before your task running, never ended. |
Send message Joined: 13 Feb 15 Posts: 1251 Credit: 994,625 RAC: 538 ![]() ![]() |
In reply to [AF>Le_Pommier] Jerome_C2005's message of 7 Jul 2025: What shall I do ?You probably don't see "..... events processed" in your runRivet.log. It's better to abort that task. I don't know the MAC-OS, but on Windows you have 3 VBoxHeadless.exe running for 1 task, whereof one process uses the most CPU. |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
In reply to Crystal Pellet's message of 8 Jul 2025: You probably don't see "..... events processed" in your runRivet.log.This I'll only be able to check and do tonight again, once back home. I don't know the MAC-OS, but on Windows you have 3 VBoxHeadless.exe running for 1 task, whereof one process uses the most CPU.I've always seen the same number of VBoxHeadless processes compared to the number of running task, and all using almost one CPU thread (for theory, other LHC apps have mt capacity) |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
Now that I'm home I realize that - task as jumped to 94,8% completion - runRivet.log is updating again and is adding stars at the end the log all the time like this Integrate 100 of 760: integrating ME[cbar,s->e+,cbar,s,e-].SubtractedReal.SubtractionIntegral : cbar s -> e+ cbar s e- , iteration 1 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** integrated ( 1.07461e-06 +/- 6.78547e-07 ) nb epsilon = 0.00017382 --------------------------------------------------- integrating ME[cbar,s->e+,cbar,s,e-].SubtractedReal.SubtractionIntegral : cbar s -> e+ cbar s e- , iteration 2 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| ******* * * * * * * * * * * * * * * * * * * * * * * * * *(the time I took to write this post and like 10 more stars were added) So after 9 days 10 hours of calculation, claiming a remaining 12 hours, there is no point cancelling it now, we'll see what happens. |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
The task finally ended this morning, on my BoincTasks history it is marked as OK, on the website account is is "calculation error", no credit. Don't trust the long running tasks. |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
Jesus, another one, 5 days 18 hours 42% deadline = in 5 days but 10 hours before estimated end date... |
Send message Joined: 17 Mar 15 Posts: 93 Credit: 991,880 RAC: 7,930 ![]() ![]() ![]() |
It is epidemic ![]() You can see the deadline is always 10 hours before the estimated end date ! They are all v7.61, unlike the huge one I mentioned above and that finally failed. So what ? 7.61 tasks are more reliable and I can trust that they will end in time ? Or I'd better cancel them always if they take like more than one day because "you never know if they will end", and then all this CPU crunch time will be lost forever in oblivion, systematically ? |
©2025 CERN