Message boards : Theory Application : Endless Theory job
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 97 ![]() ![]() |
===> [runRivet] Fri May 13 16:43:28 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] ... ... 1900 events processed Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478. Event 2000 ( 1m 1s elapsed / 50m 31s left ) -> ETA: Fri May 13 17:57 2000 events processed dumping histograms... Event 2100 ( 1m 5s elapsed / 51m 9s left ) -> ETA: Fri May 13 17:58 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). etc etc etc |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
JOB running for 7h now,does not seem to go anywhere. ===> [runRivet] Sat May 14 00:34:57 CEST 2016 [boinc ppbar uemb-soft 63 - - sherpa 2.1.1 default 1000 256] 00:34:57 +0200 2016-05-14 [INFO] New Job Starting Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). . . . about 400 of the same |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 97 ![]() ![]() |
I got the same endless job again: ===> [runRivet] Sat May 14 09:21:16 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] ... ... Event 2000 ( 1m 13s elapsed / 1h 8s left ) -> ETA: Sat May 14 10:44 2000 events processed dumping histograms... Event 2100 ( 1m 18s elapsed / 1h 39s left ) -> ETA: Sat May 14 10:44 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... etc etc etc Edit: My mistake, it's the same job, but not the same project -- Summer Challenge 2015 and not vLHCathome-dev, but in the end the jobs are coming from the same pool. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
It seems, the same job is run numerous times, one after the other. Exact same parameters, exat same log lenght. Up to 12 times. Is that deliberate?? Prepare Rivet parameters ... Same across even different tasks. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 97 ![]() ![]() |
===> [runRivet] Mon May 16 11:50:59 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264] ... ... 18.5146 pb +- ( 0.0356486 pb = 0.192543 % ) 300000 ( 631214 -> 53.3 % ) integration time: ( 7m 56s (7m 31s) elapsed / 16s (15s) left ) [12:01:08] 18.5087 pb +- ( 0.034976 pb = 0.188971 % ) 310000 ( 650036 -> 53.2 % ) integration time: ( 8m 11s (7m 45s) elapsed / 0s (0s) left ) [12:01:23] 2_3__j__j__e-__e+__j : 18.5087 pb +- ( 0.034976 pb = 0.188971 % ) exp. eff: 1.15398 % reduce max for 2_3__j__j__e-__e+__j to 0.522748 ( eps = 0.001 ) Process_Group::CalculateTotalXSec(): Calculate xs for '2_4__j__j__e-__e+__j__j' (Comix) Starting the calculation at 12:01:24. Lean back and enjoy ... . Updating display... Display update finished (0 histograms, 0 events). Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { } 32 times the above Exception and then only Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). etc etc etc |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I have the same type of job. CPU load very high >2.29 (15min average) and that is with 2 running cores. With only one core the load must be EXTREMLY high. EDIT: It appears to run the exact same job on 3 tasks simultaneously. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 ![]() |
It seems, the same job is run numerous times, one after the other. We are looking at this. It is probably a Condor rerun when a job terminates with a nonzero code, and looping would do that. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Updating display... Challenge-Task, this morning the same Error for me for about 10 hours running, also a sherpa, other sherpa running well, In Challenge there is databridge and no condor. After destroying this task, it run up to now well with other tasks. Edit: This was from midnight (German-Time) last night up to 10 hour in the morning. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Dear all, the problem of running the same job several times should be related to the transition between manual and automatci job submission/retrieval. Please notify if you see this still happening. Thank you, Leonardo |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
It is still happening. I had 2 jobs in a row with identical parameters. EDIT:And i have two separte tasks running the same job simultaneiously. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1070 Credit: 334,882 RAC: 0 ![]() |
Our working hypothesis is that for jobs which have endless loops, they are killed when the VM reaches the 18h time limit. Condor sees this as a failed job then resubmit it up to 5 times. We will disable this feature and let MCPlots handle it. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I had a job running 11 times in a row. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Hi Rasputin, can you please retrieve the job executable name from any of them? It should be something like "2016-5xxxxx-2xx.tgz". |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Where would that be? I checked all logs for ".tgz" found nothing. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Sorry, my typo: I meant ".run", not ".tgz". By the way you can find the three numbers as revision=2016, runid=5xxxxx, seed=2xx in some output file. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
There is no "revision=" and no "runid=". There is a "seed=" and "version=". I have two seperate tasks running the same job at the moment. EDIT: bad example, tune is different Input parameters: |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
EDIT:It appears, that the log files of two Boinc-task, started at the same time are virtually identical, just slight differences in time stamps. Better examples. Two jobs run on two different tasks at nearly the same time. ===> [runRivet] Tue May 17 14:28:20 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267] ===> [runRivet] Tue May 17 14:28:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267] ===> [runRivet] Tue May 17 16:37:50 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267] ===> [runRivet] Tue May 17 16:38:43 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Here are log file index of two DIFFERNT tasks: [DIR] Parent Directory - |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
EDIT:(It looks to me as if there is a certain time window, when USER/COMPUTER requests jobs, he gets only one specific job, regardless , if it is a different task or even different computer, as long as it is in the time window? I cannot test it with different computers, as i have only one.) Here are two starting lines of jobs of the SAME TASK: ===> [runRivet] Tue May 17 18:03:43 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267] ===> [runRivet] Tue May 17 17:56:21 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I am stopping computing for this sub-project, until some things are fixed. I have now the same job running, i had yesterday with the same MC-plot id number and parameters, but a different condor job id number. I really do not want to calculate the same job over and over. This job appers to be mixing two actions together: ===> [runRivet] Tue May 17 16:12:35 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264] Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.44551 pb +- ( 0.0399697 pb = 1.63441 % ) 25000 ( 98162 -> 33.4 % ) full optimization: ( 8m (9m 41s) elapsed / 1h 31m 17s (1h 50m 28s) left ) [16:36:37] Updating display... Display update finished (0 histograms, 0 events). 2.42937 pb +- ( 0.03419 pb = 1.40736 % ) 30000 ( 112822 -> 34.1 % ) full optimization: ( 9m 32s (11m 34s) elapsed / 1h 29m 3s (1h 47m 54s) left ) [16:38:09] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.41093 pb +- ( 0.0253617 pb = 1.05195 % ) 40000 ( 141922 -> 34.3 % ) full optimization: ( 12m 45s (15m 24s) elapsed / 1h 26m 6s (1h 43m 54s) left ) [16:41:21] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.3967 pb +- ( 0.0206849 pb = 0.863057 % ) 50000 ( 172547 -> 32.6 % ) full optimization: ( 16m 6s (19m 16s) elapsed / 1h 23m 43s (1h 40m 12s) left ) [16:44:42] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). |
©2025 CERN