Thread 'Endless Theory job'

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1256 Credit: 1,012,612 RAC: 156	Message 3347 - Posted: 13 May 2016, 16:10:56 UTC ===> [runRivet] Fri May 13 16:43:28 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] ... ... 1900 events processed Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478. Event 2000 ( 1m 1s elapsed / 50m 31s left ) -> ETA: Fri May 13 17:57 2000 events processed dumping histograms... Event 2100 ( 1m 5s elapsed / 51m 9s left ) -> ETA: Fri May 13 17:58 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). etc etc etc ID: 3347 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3356 - Posted: 14 May 2016, 5:38:58 UTC JOB running for 7h now,does not seem to go anywhere. ===> [runRivet] Sat May 14 00:34:57 CEST 2016 [boinc ppbar uemb-soft 63 - - sherpa 2.1.1 default 1000 256] 00:34:57 +0200 2016-05-14 [INFO] New Job Starting 00:34:57 +0200 2016-05-14 [INFO] Condor JobID: 309701 00:35:02 +0200 2016-05-14 [INFO] MCPlots JobID: 30260365 Output_Phase::Output_Phase(): Set output interval 1000000000 events. ---------------------------------------------------------- -- SHERPA generates events with the following structure -- ---------------------------------------------------------- Perturbative : Signal_Processes Perturbative : Hard_Decays Perturbative : Jet_Evolution:CSS Perturbative : Lepton_FS_QED_Corrections:Photons Perturbative : Multiple_Interactions:Amisic Perturbative : Minimum_Bias:Off Hadronization : Beam_Remnants Hadronization : Hadronization:Ahadic Hadronization : Hadron_Decays Analysis : HepMC2 Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). . . . about 400 of the same ID: 3356 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1256 Credit: 1,012,612 RAC: 156	Message 3357 - Posted: 14 May 2016, 7:52:49 UTC Last modified: 14 May 2016, 8:09:38 UTC I got the same endless job again: ===> [runRivet] Sat May 14 09:21:16 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] ... ... Event 2000 ( 1m 13s elapsed / 1h 8s left ) -> ETA: Sat May 14 10:44 2000 events processed dumping histograms... Event 2100 ( 1m 18s elapsed / 1h 39s left ) -> ETA: Sat May 14 10:44 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... etc etc etc Edit: My mistake, it's the same job, but not the same project -- Summer Challenge 2015 and not vLHCathome-dev, but in the end the jobs are coming from the same pool. ID: 3357 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3364 - Posted: 15 May 2016, 20:59:01 UTC Last modified: 15 May 2016, 21:21:45 UTC It seems, the same job is run numerous times, one after the other. Exact same parameters, exat same log lenght. Up to 12 times. Is that deliberate?? Prepare Rivet parameters ... analysesNames=CDF_2000_S4155203 Same across even different tasks. ID: 3364 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1256 Credit: 1,012,612 RAC: 156	Message 3365 - Posted: 16 May 2016, 10:38:10 UTC ===> [runRivet] Mon May 16 11:50:59 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264] ... ... 18.5146 pb +- ( 0.0356486 pb = 0.192543 % ) 300000 ( 631214 -> 53.3 % ) integration time: ( 7m 56s (7m 31s) elapsed / 16s (15s) left ) [12:01:08] 18.5087 pb +- ( 0.034976 pb = 0.188971 % ) 310000 ( 650036 -> 53.2 % ) integration time: ( 8m 11s (7m 45s) elapsed / 0s (0s) left ) [12:01:23] 2_3__j__j__e-__e+__j : 18.5087 pb +- ( 0.034976 pb = 0.188971 % ) exp. eff: 1.15398 % reduce max for 2_3__j__j__e-__e+__j to 0.522748 ( eps = 0.001 ) Process_Group::CalculateTotalXSec(): Calculate xs for '2_4__j__j__e-__e+__j__j' (Comix) Starting the calculation at 12:01:24. Lean back and enjoy ... . Updating display... Display update finished (0 histograms, 0 events). Exception_Handler::GenerateStackTrace(..): Generating stack trace { } Exception_Handler::SignalHandler: Signal (6) caught. Cannot continue. Exception_Handler::GenerateStackTrace(..): Generating stack trace { } 32 times the above Exception and then only Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). etc etc etc ID: 3365 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3366 - Posted: 16 May 2016, 10:56:18 UTC - in response to Message 3365. Last modified: 16 May 2016, 11:03:51 UTC I have the same type of job. CPU load very high >2.29 (15min average) and that is with 2 running cores. With only one core the load must be EXTREMLY high. EDIT: It appears to run the exact same job on 3 tasks simultaneously. ID: 3366 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 3367 - Posted: 16 May 2016, 12:40:58 UTC - in response to Message 3364. It seems, the same job is run numerous times, one after the other. Exact same parameters, exat same log lenght. Up to 12 times. Is that deliberate?? Prepare Rivet parameters ... analysesNames=CDF_2000_S4155203 Same across even different tasks. We are looking at this. It is probably a Condor rerun when a job terminates with a nonzero code, and looping would do that. ID: 3367 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 782 Credit: 4,057,880 RAC: 44	Message 3368 - Posted: 16 May 2016, 19:04:12 UTC - in response to Message 3357. Last modified: 16 May 2016, 19:11:51 UTC Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... etc etc etc Challenge-Task, this morning the same Error for me for about 10 hours running, also a sherpa, other sherpa running well, In Challenge there is databridge and no condor. After destroying this task, it run up to now well with other tasks. Edit: This was from midnight (German-Time) last night up to 10 hour in the morning. ID: 3368 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3369 - Posted: 16 May 2016, 22:48:31 UTC - in response to Message 3367. Dear all, the problem of running the same job several times should be related to the transition between manual and automatci job submission/retrieval. Please notify if you see this still happening. Thank you, Leonardo ID: 3369 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3371 - Posted: 17 May 2016, 12:24:36 UTC - in response to Message 3369. Last modified: 17 May 2016, 12:57:32 UTC It is still happening. I had 2 jobs in a row with identical parameters. EDIT:And i have two separte tasks running the same job simultaneiously. ID: 3371 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0	Message 3372 - Posted: 17 May 2016, 12:55:07 UTC - in response to Message 3371. Our working hypothesis is that for jobs which have endless loops, they are killed when the VM reaches the 18h time limit. Condor sees this as a failed job then resubmit it up to 5 times. We will disable this feature and let MCPlots handle it. ID: 3372 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3373 - Posted: 17 May 2016, 12:58:24 UTC - in response to Message 3372. I had a job running 11 times in a row. ID: 3373 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3374 - Posted: 17 May 2016, 13:05:54 UTC - in response to Message 3373. Hi Rasputin, can you please retrieve the job executable name from any of them? It should be something like "2016-5xxxxx-2xx.tgz". ID: 3374 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3375 - Posted: 17 May 2016, 14:17:32 UTC - in response to Message 3374. Where would that be? I checked all logs for ".tgz" found nothing. ID: 3375 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3376 - Posted: 17 May 2016, 14:36:00 UTC - in response to Message 3375. Sorry, my typo: I meant ".run", not ".tgz". By the way you can find the three numbers as revision=2016, runid=5xxxxx, seed=2xx in some output file. ID: 3376 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3377 - Posted: 17 May 2016, 15:26:40 UTC - in response to Message 3376. Last modified: 17 May 2016, 15:37:21 UTC There is no "revision=" and no "runid=". There is a "seed=" and "version=". I have two seperate tasks running the same job at the moment. EDIT: bad example, tune is different Input parameters: mode=boinc beam=ppbar process=z energy=1800 params=-,-,50,130 specific=- generator=pythia6 version=6.428 tune=394 nevts=100000 seed=267 Prepare Rivet parameters ... analysesNames=CDF_2000_S4155203 ====================================================================== Input parameters: mode=boinc beam=ppbar process=z energy=1800 params=-,-,50,130 specific=- generator=pythia6 version=6.428 tune=352 nevts=100000 seed=267 Prepare Rivet parameters ... analysesNames=CDF_2000_S4155203 ID: 3377 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3378 - Posted: 17 May 2016, 15:49:42 UTC Last modified: 17 May 2016, 15:58:27 UTC EDIT:It appears, that the log files of two Boinc-task, started at the same time are virtually identical, just slight differences in time stamps. Better examples. Two jobs run on two different tasks at nearly the same time. ===> [runRivet] Tue May 17 14:28:20 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267] ===> [runRivet] Tue May 17 14:28:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267] ===> [runRivet] Tue May 17 16:37:50 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267] ===> [runRivet] Tue May 17 16:38:43 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267] ID: 3378 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3379 - Posted: 17 May 2016, 16:08:06 UTC Here are log file index of two DIFFERNT tasks: [DIR] Parent Directory - [ ] MasterLog 17-May-2016 15:13 1.9K [ ] StartLog 17-May-2016 18:03 6.8K [ ] StarterLog 17-May-2016 18:03 24K [ ] finished_1.log 17-May-2016 14:28 45K [ ] finished_2.log 17-May-2016 15:23 57K [ ] finished_3.log 17-May-2016 16:05 54K [ ] finished_4.log 17-May-2016 16:37 44K [ ] finished_5.log 17-May-2016 17:14 58K [ ] finished_6.log 17-May-2016 17:56 57K [ ] finished_7.log 17-May-2016 18:03 70K [ ] running.log 17-May-2016 18:05 40K [ ] stderr.log 17-May-2016 18:03 630 [TXT] stdout.log 17-May-2016 18:03 1.8K [DIR] Parent Directory - [ ] MasterLog 17-May-2016 15:13 1.9K [ ] StartLog 17-May-2016 18:04 6.8K [ ] StarterLog 17-May-2016 18:04 24K [ ] finished_1.log 17-May-2016 14:28 45K [ ] finished_2.log 17-May-2016 15:24 57K [ ] finished_3.log 17-May-2016 16:06 54K [ ] finished_4.log 17-May-2016 16:38 44K [ ] finished_5.log 17-May-2016 17:14 58K [ ] finished_6.log 17-May-2016 17:57 57K [ ] finished_7.log 17-May-2016 18:04 70K [ ] running.log 17-May-2016 18:05 34K [ ] stderr.log 17-May-2016 18:04 630 [TXT] stdout.log 17-May-2016 18:05 2.0K ID: 3379 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3380 - Posted: 17 May 2016, 16:20:23 UTC Last modified: 17 May 2016, 16:33:16 UTC EDIT:(It looks to me as if there is a certain time window, when USER/COMPUTER requests jobs, he gets only one specific job, regardless , if it is a different task or even different computer, as long as it is in the time window? I cannot test it with different computers, as i have only one.) Here are two starting lines of jobs of the SAME TASK: ===> [runRivet] Tue May 17 18:03:43 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267] ===> [runRivet] Tue May 17 17:56:21 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267] ID: 3380 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 42	Message 3381 - Posted: 17 May 2016, 17:19:49 UTC Last modified: 17 May 2016, 17:20:17 UTC I am stopping computing for this sub-project, until some things are fixed. I have now the same job running, i had yesterday with the same MC-plot id number and parameters, but a different condor job id number. I really do not want to calculate the same job over and over. This job appers to be mixing two actions together: ===> [runRivet] Tue May 17 16:12:35 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264] Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.44551 pb +- ( 0.0399697 pb = 1.63441 % ) 25000 ( 98162 -> 33.4 % ) full optimization: ( 8m (9m 41s) elapsed / 1h 31m 17s (1h 50m 28s) left ) [16:36:37] Updating display... Display update finished (0 histograms, 0 events). 2.42937 pb +- ( 0.03419 pb = 1.40736 % ) 30000 ( 112822 -> 34.1 % ) full optimization: ( 9m 32s (11m 34s) elapsed / 1h 29m 3s (1h 47m 54s) left ) [16:38:09] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.41093 pb +- ( 0.0253617 pb = 1.05195 % ) 40000 ( 141922 -> 34.3 % ) full optimization: ( 12m 45s (15m 24s) elapsed / 1h 26m 6s (1h 43m 54s) left ) [16:41:21] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). 2.3967 pb +- ( 0.0206849 pb = 0.863057 % ) 50000 ( 172547 -> 32.6 % ) full optimization: ( 16m 6s (19m 16s) elapsed / 1h 23m 43s (1h 40m 12s) left ) [16:44:42] Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). ID: 3381 · Rating: 0 · rate: / Reply Quote

Development for LHC@home