Message boards : Theory Application : Endless Theory job
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Have the same looping sherpa-task in Challenge as days before. databridge-client.log: [20/05/2016 09:29:31] ERROR: The queue file is gone [20/05/2016 09:29:31] INFO: A recoverable error occured, will retry in a second [20/05/2016 09:29:32] INFO: Fetching next job in queue [20/05/2016 09:29:35] INFO: Starting job 2085677b-a46a-4aa0-aff6-9f464eef81b9 Challenge job.out: test4theory-bd2e19ec-c251-4b35-b5ac-1ed8303a0d8b ===> [runRivet] Fri May 20 09:29:35 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] Setting environment... MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65 gcc = /cvmfs/sft.cern.ch/lcg/external/gcc/4.7.2/x86_64-slc6-gcc47-opt/bin/gcc gcc version = 4.7.2 RIVET=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt RIVET_REF_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet RIVET_ANALYSIS_PATH=/tmp/tmp.HbQUR6JHYi/analyses ROOTSYS=/cvmfs/sft.cern.ch/lcg/app/releases/ROOT/5.34.19/x86_64-slc6-gcc47-opt/root Input parameters: mode=boinc beam=ee process=zhad energy=133 params=- specific=- generator=sherpa version=1.2.3 tune=default nevts=100000 seed=220 Prepare temporary directories and files ... workd=/tmp/tmp.HbQUR6JHYi tmpd=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X tmp_params=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.params tmp_hepmc=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.hepmc tmp_yoda=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.yoda tmp_jobs=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/jobs.log tmpd_flat=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/flat tmpd_dump=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/dump tmpd_html=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/html Event 1900 ( 1m 32s elapsed / 1h 19m 44s left ) -> ETA: Fri May 20 11:23 1900 events processed Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478. Event 2000 ( 1m 37s elapsed / 1h 19m 39s left ) -> ETA: Fri May 20 11:23 2000 events processed dumping histograms... Event 2100 ( 1m 40s elapsed / 1h 18m 20s left ) -> ETA: Fri May 20 11:22 2100 events processed Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Updating display... Display update finished (55 histograms, 2000 events). Give him a destroy for a new task. Edit: saw that time in 1900... and 2000... is in the future! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
[boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] I saw that job at least ten times. I then reset the VM to get a new job, but that job refuse to leave the queue. Edit: saw that time in 1900... and 2000... is in the future! Yeah, of course that datetimegroup is in the future, cause it's the 'ETA'; means Estimated Time of Arrival (or completion here) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Duplicate jobs are happening again. First two numbers show "slot number,finished_x_log number" EDIT:Looks like only jobs with the "seed=268" are affected, as they were 2 days ago 3,1===> [runRivet] Fri May 20 22:40:41 CEST 2016 [boinc pp uemb-hard 200 4 - pythia6 6.428 a 100000 273] 3,2===> [runRivet] Fri May 20 22:57:03 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-1 100000 273] 3,3===> [runRivet] Sat May 21 00:11:38 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268] 3,4===> [runRivet] Sat May 21 01:24:02 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268] 3,5===> [runRivet] Sat May 21 02:32:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 375 100000 268] 3,6===> [runRivet] Sat May 21 03:15:21 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]-------------- 3,7===> [runRivet] Sat May 21 03:52:34 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 d6t 100000 268] 3,8===> [runRivet] Sat May 21 04:29:22 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,9===> [runRivet] Sat May 21 05:20:22 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]-------------- 3,10===> [runRivet] Sat May 21 05:57:19 CEST 2016 [boinc ppbar uemb-soft 900 - - pythia6 6.427 379 100000 268] 3,11===> [runRivet] Sat May 21 06:14:19 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.205 tune-AU2lox 100000 268] 3,12===> [runRivet] Sat May 21 07:19:05 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,13===> [runRivet] Sat May 21 08:10:47 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268] 3,14===> [runRivet] Sat May 21 09:03:13 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 dw 100000 268] 3,15===> [runRivet] Sat May 21 09:48:07 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-monashstar 100000 268] |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
Duplicate jobs are happening again. This thread is called 'Endless Theory job', but it seems your 'duplicates' are all finishing normal, is that right? If so, it may be better to create a new thread about the duplicate job issue. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
True, but on the other hand, a job that is run again and again and again is endless, isn't it? |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
I agree with Crystal but all the information are in this thread now and hopefully the issue is going to be solved. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
On my side I see at most 26 "fresh" jobs (Condor JobID > 316094) out of 77 total jobs running. Maybe you got some "old" jobs (Condor JobID < 316094) which were affected by the duplicate issue, do you confirm? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Apart from one, all jobs i have done today have an ID<316000. My MC plot count has actually decreased! |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Ok, at least it is consistent. There are about 250 old jobs in the queue but I just enabled the submission of new jobs with higher priority so you should start to get them in 20' or so. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
Endless job in a multi-core VM is very annoying. To get rid of it, I've to kill all other jobs too. ===> [runRivet] Sat Jul 30 11:17:39 CEST 2016 [boinc ppbar jets 1960 37 - sherpa 1.4.2 default 75000 394] . . . . 12000 events processed Event 12100 ( 11m 26s elapsed / 59m 29s left ) -> ETA: Sat Jul 30 13:42 12100 events processed Event 12200 ( 11m 37s elapsed / 59m 49s left ) -> ETA: Sat Jul 30 13:42 12200 events processed Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). Updating display... Display update finished (37 histograms, 12000 events). etc etc etc |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1070 Credit: 334,882 RAC: 0 ![]() |
An improvement for this situation on it's way. I just have to find some time to implement it. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
===> [runRivet] Sun Jul 31 00:07:55 CEST 2016 [boinc pp uemb-soft 53 - - sherpa 2.1.1 default 1000 394] Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display... Display update finished (0 histograms, 0 events). Updating display...etc.etc.etc 00:07:53 +0200 2016-07-31 [INFO] New Job Starting in slot4 00:07:54 +0200 2016-07-31 [INFO] Condor JobID: 1347271 in slot4 00:7:59 +0200 2016-07-31 [INFO] MCPlots JobID: 32135745 in slot4 >>>>>>run-time > 8h30min Here is a list of jobs, likely to fail/never end: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2016&display=unsucc |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
An improvement for this situation on it's way. I just have to find some time to implement it. The solution suggested will not solve this. The endless job running is using a core and not idling. Until so far there is nothing to detect this kind of looping except my human eye. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I think, i have a way to calculate the best termination time past 12h and minimize time lost. Take the time of last change from all the starterlog.slotX. Find the time of the oldest to present. Take the sum of all remaining slot times to the present. If the sum is greater than the oldest--->terminate This check should be done after 12h run-time ever 5 minutes or so. EDIT: it also need a min trigger time, like 2h or so, below which nothing is terminated. What do you guys think? |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1070 Credit: 334,882 RAC: 0 ![]() |
This is kind of similar to what I already suggested. I just need time to implement it. It is already on the task tracker. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Yes, but it is more specific to how to actually do it and geared towards 1 job, that is excessively long.(That is, what we were trying to address, wasn't it?) |
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
Unusual to have a looping Pythia job: 04:23:58 +0100 2016-08-11 [INFO] New Job Starting in slot2 04:23:58 +0100 2016-08-11 [INFO] Condor JobID: 1506145 in slot2 04:24:03 +0100 2016-08-11 [INFO] MCPlots JobID: 32185396 in slot2 Input parameters: mode=boinc beam=ppbar process=uemb-soft energy=1960 params=- specific=- generator=pythia8 version=8.165 tune=default-MBR nevts=100000 seed=402 33200 events processed 33300 events processed Display update finished (4 histograms, 33000 events). 33400 events processed Updating display... Display update finished (4 histograms, 33000 events). Only just spotted the 17+hr runtime so looked inside to find the looper and the other core sitting idle for 8hrs. Only a few mins until it terminates naturally at 18hrs so I'll just let it do that rather than resetting the VM. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Is it looping? It seems to make progress. (It might just be veeery slow) |
![]() ![]() Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 0 ![]() |
Yes. The Updating display... Display update finished (4 histograms, 33000 events). repeated until the 18hr cutoff. It's gone now and I haven't had that job back (yet) I thought it unusual as I think I've only seen Sherpas looping before. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
Yes. The But you had increasing events processed in between . . . The updating display is done with fixed time intervals, also when the next 1000 events are not reached. |
©2025 CERN