Message boards : Theory Application : Endless Theory job
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
maeax

Send message
Joined: 22 Apr 16
Posts: 670
Credit: 1,828,052
RAC: 3,628
Message 3415 - Posted: 20 May 2016, 8:40:22 UTC
Last modified: 20 May 2016, 8:51:22 UTC

Have the same looping sherpa-task in Challenge as days before.

databridge-client.log:

[20/05/2016 09:29:31] ERROR: The queue file is gone
[20/05/2016 09:29:31] INFO: A recoverable error occured, will retry in a second
[20/05/2016 09:29:32] INFO: Fetching next job in queue
[20/05/2016 09:29:35] INFO: Starting job 2085677b-a46a-4aa0-aff6-9f464eef81b9

Challenge job.out:

test4theory-bd2e19ec-c251-4b35-b5ac-1ed8303a0d8b

===> [runRivet] Fri May 20 09:29:35 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220]

Setting environment...
MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65
gcc = /cvmfs/sft.cern.ch/lcg/external/gcc/4.7.2/x86_64-slc6-gcc47-opt/bin/gcc
gcc version = 4.7.2
RIVET=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt
RIVET_REF_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet
RIVET_ANALYSIS_PATH=/tmp/tmp.HbQUR6JHYi/analyses
ROOTSYS=/cvmfs/sft.cern.ch/lcg/app/releases/ROOT/5.34.19/x86_64-slc6-gcc47-opt/root

Input parameters:
mode=boinc
beam=ee
process=zhad
energy=133
params=-
specific=-
generator=sherpa
version=1.2.3
tune=default
nevts=100000
seed=220

Prepare temporary directories and files ...
workd=/tmp/tmp.HbQUR6JHYi
tmpd=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X
tmp_params=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.params
tmp_hepmc=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.hepmc
tmp_yoda=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/generator.yoda
tmp_jobs=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/jobs.log
tmpd_flat=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/flat
tmpd_dump=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/dump
tmpd_html=/tmp/tmp.HbQUR6JHYi/tmp/tmp.o6zSpA6r4X/html


Event 1900 ( 1m 32s elapsed / 1h 19m 44s left ) -> ETA: Fri May 20 11:23
1900 events processed
Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478.
Event 2000 ( 1m 37s elapsed / 1h 19m 39s left ) -> ETA: Fri May 20 11:23
2000 events processed
dumping histograms...
Event 2100 ( 1m 40s elapsed / 1h 18m 20s left ) -> ETA: Fri May 20 11:22
2100 events processed
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).

Give him a destroy for a new task.

Edit: saw that time in 1900... and 2000... is in the future!
ID: 3415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 820,099
RAC: 638
Message 3418 - Posted: 20 May 2016, 9:26:27 UTC - in response to Message 3415.  

[boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220]

I saw that job at least ten times. I then reset the VM to get a new job, but that job refuse to leave the queue.

Edit: saw that time in 1900... and 2000... is in the future!

Yeah, of course that datetimegroup is in the future, cause it's the 'ETA'; means Estimated Time of Arrival (or completion here)
ID: 3418 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3438 - Posted: 21 May 2016, 8:16:21 UTC
Last modified: 21 May 2016, 8:35:47 UTC

Duplicate jobs are happening again.
First two numbers show "slot number,finished_x_log number"

EDIT:Looks like only jobs with the "seed=268" are affected, as they were 2 days ago

3,1===> [runRivet] Fri May 20 22:40:41 CEST 2016 [boinc pp uemb-hard 200 4 - pythia6 6.428 a 100000 273]
3,2===> [runRivet] Fri May 20 22:57:03 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-1 100000 273]
3,3===> [runRivet] Sat May 21 00:11:38 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268]
3,4===> [runRivet] Sat May 21 01:24:02 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.205 tune-A14-CTEQL1 100000 268]
3,5===> [runRivet] Sat May 21 02:32:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 375 100000 268]
3,6===> [runRivet] Sat May 21 03:15:21 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]--------------
3,7===> [runRivet] Sat May 21 03:52:34 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 d6t 100000 268]
3,8===> [runRivet] Sat May 21 04:29:22 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268]
3,9===> [runRivet] Sat May 21 05:20:22 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 396 100000 268]--------------
3,10===> [runRivet] Sat May 21 05:57:19 CEST 2016 [boinc ppbar uemb-soft 900 - - pythia6 6.427 379 100000 268]
3,11===> [runRivet] Sat May 21 06:14:19 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.205 tune-AU2lox 100000 268]
3,12===> [runRivet] Sat May 21 07:19:05 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268]
3,13===> [runRivet] Sat May 21 08:10:47 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia8 8.183 default 100000 268]
3,14===> [runRivet] Sat May 21 09:03:13 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.425 dw 100000 268]
3,15===> [runRivet] Sat May 21 09:48:07 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 tune-monashstar 100000 268]
ID: 3438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 820,099
RAC: 638
Message 3440 - Posted: 21 May 2016, 9:58:02 UTC - in response to Message 3438.  

Duplicate jobs are happening again.

This thread is called 'Endless Theory job', but it seems your 'duplicates' are all finishing normal, is that right?
If so, it may be better to create a new thread about the duplicate job issue.
ID: 3440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3441 - Posted: 21 May 2016, 10:05:03 UTC - in response to Message 3440.  

True, but on the other hand, a job that is run again and again and again is endless, isn't it?
ID: 3441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3443 - Posted: 21 May 2016, 10:20:18 UTC - in response to Message 3441.  

I agree with Crystal but all the information are in this thread now and hopefully the issue is going to be solved.
ID: 3443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3444 - Posted: 21 May 2016, 10:23:45 UTC - in response to Message 3438.  

On my side I see at most 26 "fresh" jobs (Condor JobID > 316094) out of 77 total jobs running.
Maybe you got some "old" jobs (Condor JobID < 316094) which were affected by the duplicate issue, do you confirm?
ID: 3444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3454 - Posted: 21 May 2016, 19:36:05 UTC - in response to Message 3444.  

Apart from one, all jobs i have done today have an ID<316000.
My MC plot count has actually decreased!
ID: 3454 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3456 - Posted: 21 May 2016, 20:05:26 UTC - in response to Message 3454.  

Ok, at least it is consistent. There are about 250 old jobs in the queue but I just enabled the submission of new jobs with higher priority so you should start to get them in 20' or so.
ID: 3456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 820,099
RAC: 638
Message 3883 - Posted: 30 Jul 2016, 12:05:21 UTC

Endless job in a multi-core VM is very annoying. To get rid of it, I've to kill all other jobs too.

===> [runRivet] Sat Jul 30 11:17:39 CEST 2016 [boinc ppbar jets 1960 37 - sherpa 1.4.2 default 75000 394]
.
.
.
.
12000 events processed
Event 12100 ( 11m 26s elapsed / 59m 29s left ) -> ETA: Sat Jul 30 13:42
12100 events processed
Event 12200 ( 11m 37s elapsed / 59m 49s left ) -> ETA: Sat Jul 30 13:42
12200 events processed
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
Updating display...
Display update finished (37 histograms, 12000 events).
etc etc etc
ID: 3883 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 3896 - Posted: 30 Jul 2016, 21:32:37 UTC - in response to Message 3883.  

An improvement for this situation on it's way. I just have to find some time to implement it.
ID: 3896 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3900 - Posted: 31 Jul 2016, 6:46:29 UTC
Last modified: 31 Jul 2016, 7:15:16 UTC

===> [runRivet] Sun Jul 31 00:07:55 CEST 2016 [boinc pp uemb-soft 53 - - sherpa 2.1.1 default 1000 394]

Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...etc.etc.etc

00:07:53 +0200 2016-07-31 [INFO] New Job Starting in slot4
00:07:54 +0200 2016-07-31 [INFO] Condor JobID: 1347271 in slot4
00:7:59 +0200 2016-07-31 [INFO] MCPlots JobID: 32135745 in slot4

>>>>>>run-time > 8h30min


Here is a list of jobs, likely to fail/never end:
http://mcplots-dev.cern.ch/production.php?view=runs&rev=2016&display=unsucc
ID: 3900 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 820,099
RAC: 638
Message 3901 - Posted: 31 Jul 2016, 7:39:09 UTC - in response to Message 3896.  

An improvement for this situation on it's way. I just have to find some time to implement it.

The solution suggested will not solve this. The endless job running is using a core and not idling.
Until so far there is nothing to detect this kind of looping except my human eye.
ID: 3901 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4016 - Posted: 8 Aug 2016, 20:41:34 UTC
Last modified: 8 Aug 2016, 20:47:28 UTC

I think, i have a way to calculate the best termination time past 12h and minimize time lost.

Take the time of last change from all the starterlog.slotX.

Find the time of the oldest to present.
Take the sum of all remaining slot times to the present.
If the sum is greater than the oldest--->terminate

This check should be done after 12h run-time ever 5 minutes or so.

EDIT: it also need a min trigger time, like 2h or so, below which nothing is terminated.


What do you guys think?
ID: 4016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 4017 - Posted: 8 Aug 2016, 21:17:11 UTC - in response to Message 4016.  

This is kind of similar to what I already suggested. I just need time to implement it. It is already on the task tracker.
ID: 4017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4018 - Posted: 8 Aug 2016, 21:35:42 UTC - in response to Message 4017.  

Yes, but it is more specific to how to actually do it and geared towards 1 job, that is excessively long.(That is, what we were trying to address, wasn't it?)
ID: 4018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 4065 - Posted: 12 Aug 2016, 20:49:20 UTC

Unusual to have a looping Pythia job:

04:23:58 +0100 2016-08-11 [INFO] New Job Starting in slot2
04:23:58 +0100 2016-08-11 [INFO] Condor JobID: 1506145 in slot2
04:24:03 +0100 2016-08-11 [INFO] MCPlots JobID: 32185396 in slot2

Input parameters:
mode=boinc
beam=ppbar
process=uemb-soft
energy=1960
params=-
specific=-
generator=pythia8
version=8.165
tune=default-MBR
nevts=100000
seed=402

33200 events processed
33300 events processed
Display update finished (4 histograms, 33000 events).
33400 events processed
Updating display...
Display update finished (4 histograms, 33000 events).

Only just spotted the 17+hr runtime so looked inside to find the looper and the other core sitting idle for 8hrs. Only a few mins until it terminates naturally at 18hrs so I'll just let it do that rather than resetting the VM.
ID: 4065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4066 - Posted: 12 Aug 2016, 20:58:28 UTC
Last modified: 12 Aug 2016, 21:02:28 UTC

Is it looping?

It seems to make progress.

(It might just be veeery slow)
ID: 4066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 4067 - Posted: 13 Aug 2016, 8:45:08 UTC - in response to Message 4066.  
Last modified: 13 Aug 2016, 8:46:55 UTC

Yes. The

Updating display...
Display update finished (4 histograms, 33000 events).

repeated until the 18hr cutoff.
It's gone now and I haven't had that job back (yet)

I thought it unusual as I think I've only seen Sherpas looping before.
ID: 4067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 820,099
RAC: 638
Message 4068 - Posted: 13 Aug 2016, 9:17:51 UTC - in response to Message 4067.  
Last modified: 13 Aug 2016, 9:50:52 UTC

Yes. The

Updating display...
Display update finished (4 histograms, 33000 events).

repeated until the 18hr cutoff.
It's gone now and I haven't had that job back (yet)

I thought it unusual as I think I've only seen Sherpas looping before.

But you had increasing events processed in between . . .

The updating display is done with fixed time intervals, also when the next 1000 events are not reached.
ID: 4068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Theory Application : Endless Theory job


©2024 CERN