Thread 'Endless Theory job'

Author	Message
maeax Send message Joined: 22 Apr 16 Posts: 782 Credit: 4,057,880 RAC: 8,388	Message 3415 - Posted: 20 May 2016, 8:40:22 UTC Last modified: 20 May 2016, 8:51:22 UTC Have the same looping sherpa-task in Challenge as days before. databridge-client.log: [20/05/2016 09:29:31] ERROR: The queue file is gone [20/05/2016 09:29:31] INFO: A recoverable error occured, will retry in a second [20/05/2016 09:29:32] INFO: Fetching next job in queue [20/05/2016 09:29:35] INFO: Starting job 2085677b-a46a-4aa0-aff6-9f464eef81b9 Challenge job.out: test4theory-bd2e19ec-c251-4b35-b5ac-1ed8303a0d8b ===> [runRivet] Fri May 20 09:29:35 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] Setting environment... MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65 gcc = /cvmfs/sft.cern.ch/lcg/external/gcc/4.7.2/x86_64-slc6-gcc47-opt/bin/gcc gcc version = 4.7.2 RIVET=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt RIVET_REF_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet RIVET_ANALYSIS_PATH=/tmp/tmp.HbQUR6JHYi/analyses ROOTSYS=/cvmfs/sft.cern.ch/lcg/app/releases/ROOT/5.34.19/x86_64-slc6-gcc47-opt/root Input parameters: mode=boinc beam=ee process=zhad energy=133 params=- specific=- generator=sherpa version=1.2.3 tune=default nevts=100000 seed=220 Prepare temporary directories and files ... workd=/tmp/tmp.HbQUR6JHYi tmpd=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X tmp_params=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.params tmp_hepmc=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.hepmc tmp_yoda=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.yoda tmp_jobs=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/jobs.log tmpd_flat=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/flat tmpd_dump=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/dump tmpd_html=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/html Event 1900 ( 1m 32s elapsed / 1h 19m 44s left ) -> ETA: Fri May 20 11:23 1900 events processed Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478. Event 2000 ( 1m 37s elapsed / 1h 19m 39s left ) -> ETA: Fri May 20 11:23 2000 events processed dumping histograms... Event 2100 ( 1m 40s elapsed / 1h 18m 20s left ) -> ETA: Fri May 20 11:22 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Give him a destroy for a new task. Edit: saw that time in 1900... and 2000... is in the future! ID: 3415 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,764 RAC: 61	Message 3418 - Posted: 20 May 2016, 9:26:27 UTC - in response to Message 3415. [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] I saw that job at least ten times. I then reset the VM to get a new job, but that job refuse to leave the queue. Edit: saw that time in 1900... and 2000... is in the future! Yeah, of course that datetimegroup is in the future, cause it's the 'ETA'; means Estimated Time of Arrival (or completion here) ID: 3418 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 3438 - Posted: 21 May 2016, 8:16:21 UTC Last modified: 21 May 2016, 8:35:47 UTC Duplicate jobs are happening again. First two numbers show "slot number,finished_x_log number" EDIT:Looks like only jobs with the "seed=268" are affected, as they were 2 days ago 3,1===> [runRivet] Fri May 20 22:40:41 CEST 2016 [boinc pp uemb-hard 200 4 - pythia6 6.428 a 100000 273] 3,2===> [runRivet] Fri May 20 22:57:03 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-1 100000 273] 3,3===> [runRivet] Sat May 21 00:11:38 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268] 3,4===> [runRivet] Sat May 21 01:24:02 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268] 3,5===> [runRivet] Sat May 21 02:32:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 375 100000 268] 3,6===> [runRivet] Sat May 21 03:15:21 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]-------------- 3,7===> [runRivet] Sat May 21 03:52:34 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 d6t 100000 268] 3,8===> [runRivet] Sat May 21 04:29:22 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,9===> [runRivet] Sat May 21 05:20:22 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]-------------- 3,10===> [runRivet] Sat May 21 05:57:19 CEST 2016 [boinc ppbar uemb-soft 900 - - pythia6 6.427 379 100000 268] 3,11===> [runRivet] Sat May 21 06:14:19 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.205 tune-AU2lox 100000 268] 3,12===> [runRivet] Sat May 21 07:19:05 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,13===> [runRivet] Sat May 21 08:10:47 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,14===> [runRivet] Sat May 21 09:03:13 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 dw 100000 268] 3,15===> [runRivet] Sat May 21 09:48:07 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-monashstar 100000 268] ID: 3438 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,764 RAC: 61	Message 3440 - Posted: 21 May 2016, 9:58:02 UTC - in response to Message 3438. Duplicate jobs are happening again. This thread is called 'Endless Theory job', but it seems your 'duplicates' are all finishing normal, is that right? If so, it may be better to create a new thread about the duplicate job issue. ID: 3440 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 3441 - Posted: 21 May 2016, 10:05:03 UTC - in response to Message 3440. True, but on the other hand, a job that is run again and again and again is endless, isn't it? ID: 3441 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3443 - Posted: 21 May 2016, 10:20:18 UTC - in response to Message 3441. I agree with Crystal but all the information are in this thread now and hopefully the issue is going to be solved. ID: 3443 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3444 - Posted: 21 May 2016, 10:23:45 UTC - in response to Message 3438. On my side I see at most 26 "fresh" jobs (Condor JobID > 316094) out of 77 total jobs running. Maybe you got some "old" jobs (Condor JobID < 316094) which were affected by the duplicate issue, do you confirm? ID: 3444 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 3454 - Posted: 21 May 2016, 19:36:05 UTC - in response to Message 3444. Apart from one, all jobs i have done today have an ID<316000. My MC plot count has actually decreased! ID: 3454 · Rating: 0 · rate: / Reply Quote

Leonardo Cristella Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0	Message 3456 - Posted: 21 May 2016, 20:05:26 UTC - in response to Message 3454. Ok, at least it is consistent. There are about 250 old jobs in the queue but I just enabled the submission of new jobs with higher priority so you should start to get them in 20' or so. ID: 3456 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,764 RAC: 61	Message 3883 - Posted: 30 Jul 2016, 12:05:21 UTC Endless job in a multi-core VM is very annoying. To get rid of it, I've to kill all other jobs too. ===> [runRivet] Sat Jul 30 11:17:39 CEST 2016 [boinc ppbar jets 1960 37 - sherpa 1.4.2 default 75000 394] . . . . 12000 events processed Event 12100 ( 11m 26s elapsed / 59m 29s left ) -> ETA: Sat Jul 30 13:42 12100 events processed Event 12200 ( 11m 37s elapsed / 59m 49s left ) -> ETA: Sat Jul 30 13:42 12200 events processed Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). etc etc etc ID: 3883 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 7	Message 3896 - Posted: 30 Jul 2016, 21:32:37 UTC - in response to Message 3883. An improvement for this situation on it's way. I just have to find some time to implement it. ID: 3896 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 3900 - Posted: 31 Jul 2016, 6:46:29 UTC Last modified: 31 Jul 2016, 7:15:16 UTC ===> [runRivet] Sun Jul 31 00:07:55 CEST 2016 [boinc pp uemb-soft 53 - - sherpa 2.1.1 default 1000 394] Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display...etc.etc.etc 00:07:53 +0200 2016-07-31 [INFO] New Job Starting in slot4 00:07:54 +0200 2016-07-31 [INFO] Condor JobID: 1347271 in slot4 00:7:59 +0200 2016-07-31 [INFO] MCPlots JobID: 32135745 in slot4 >>>>>>run-time > 8h30min Here is a list of jobs, likely to fail/never end: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2016&display=unsucc ID: 3900 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,764 RAC: 61	Message 3901 - Posted: 31 Jul 2016, 7:39:09 UTC - in response to Message 3896. An improvement for this situation on it's way. I just have to find some time to implement it. The solution suggested will not solve this. The endless job running is using a core and not idling. Until so far there is nothing to detect this kind of looping except my human eye. ID: 3901 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 4016 - Posted: 8 Aug 2016, 20:41:34 UTC Last modified: 8 Aug 2016, 20:47:28 UTC I think, i have a way to calculate the best termination time past 12h and minimize time lost. Take the time of last change from all the starterlog.slotX. Find the time of the oldest to present. Take the sum of all remaining slot times to the present. If the sum is greater than the oldest--->terminate This check should be done after 12h run-time ever 5 minutes or so. EDIT: it also need a min trigger time, like 2h or so, below which nothing is terminated. What do you guys think? ID: 4016 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 7	Message 4017 - Posted: 8 Aug 2016, 21:17:11 UTC - in response to Message 4016. This is kind of similar to what I already suggested. I just need time to implement it. It is already on the task tracker. ID: 4017 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 4018 - Posted: 8 Aug 2016, 21:35:42 UTC - in response to Message 4017. Yes, but it is more specific to how to actually do it and geared towards 1 job, that is excessively long.(That is, what we were trying to address, wasn't it?) ID: 4018 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 1,789	Message 4065 - Posted: 12 Aug 2016, 20:49:20 UTC Unusual to have a looping Pythia job: 04:23:58 +0100 2016-08-11 [INFO] New Job Starting in slot2 04:23:58 +0100 2016-08-11 [INFO] Condor JobID: 1506145 in slot2 04:24:03 +0100 2016-08-11 [INFO] MCPlots JobID: 32185396 in slot2 Input parameters: mode=boinc beam=ppbar process=uemb-soft energy=1960 params=- specific=- generator=pythia8 version=8.165 tune=default-MBR nevts=100000 seed=402 33200 events processed 33300 events processed Display update finished (4 histograms, 33000 events). 33400 events processed Updating display... Display update finished (4 histograms, 33000 events). Only just spotted the 17+hr runtime so looked inside to find the looper and the other core sitting idle for 8hrs. Only a few mins until it terminates naturally at 18hrs so I'll just let it do that rather than resetting the VM. ID: 4065 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 8	Message 4066 - Posted: 12 Aug 2016, 20:58:28 UTC Last modified: 12 Aug 2016, 21:02:28 UTC Is it looping? It seems to make progress. (It might just be veeery slow) ID: 4066 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 1,789	Message 4067 - Posted: 13 Aug 2016, 8:45:08 UTC - in response to Message 4066. Last modified: 13 Aug 2016, 8:46:55 UTC Yes. The Updating display... Display update finished (4 histograms, 33000 events). repeated until the 18hr cutoff. It's gone now and I haven't had that job back (yet) I thought it unusual as I think I've only seen Sherpas looping before. ID: 4067 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,764 RAC: 61	Message 4068 - Posted: 13 Aug 2016, 9:17:51 UTC - in response to Message 4067. Last modified: 13 Aug 2016, 9:50:52 UTC Yes. The Updating display... Display update finished (4 histograms, 33000 events). repeated until the 18hr cutoff. It's gone now and I haven't had that job back (yet) I thought it unusual as I think I've only seen Sherpas looping before. But you had increasing events processed in between . . . The updating display is done with fixed time intervals, also when the next 1000 events are not reached. ID: 4068 · Rating: 0 · rate: / Reply Quote

Development for LHC@home