Message boards : Theory Application : Endless Theory job
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 13
Message 3347 - Posted: 13 May 2016, 16:10:56 UTC

===> [runRivet] Fri May 13 16:43:28 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220]
...
...
1900 events processed
Phase_Space_Handler::OneEvent(): Point for '2_4__e-__e+__u__c__ub__cb' exceeds maximum by 0.343478.
Event 2000 ( 1m 1s elapsed / 50m 31s left ) -> ETA: Fri May 13 17:57
2000 events processed
dumping histograms...
Event 2100 ( 1m 5s elapsed / 51m 9s left ) -> ETA: Fri May 13 17:58
2100 events processed
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
etc etc etc
ID: 3347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3356 - Posted: 14 May 2016, 5:38:58 UTC

JOB running for 7h now,does not seem to go anywhere.


===> [runRivet] Sat May 14 00:34:57 CEST 2016 [boinc ppbar uemb-soft 63 - - sherpa 2.1.1 default 1000 256]


00:34:57 +0200 2016-05-14 [INFO] New Job Starting
00:34:57 +0200 2016-05-14 [INFO] Condor JobID: 309701
00:35:02 +0200 2016-05-14 [INFO] MCPlots JobID: 30260365


Output_Phase::Output_Phase(): Set output interval 1000000000 events.
----------------------------------------------------------
-- SHERPA generates events with the following structure --
----------------------------------------------------------
Perturbative : Signal_Processes
Perturbative : Hard_Decays
Perturbative : Jet_Evolution:CSS
Perturbative : Lepton_FS_QED_Corrections:Photons
Perturbative : Multiple_Interactions:Amisic
Perturbative : Minimum_Bias:Off
Hadronization : Beam_Remnants
Hadronization : Hadronization:Ahadic
Hadronization : Hadron_Decays
Analysis : HepMC2
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).

.
.
.
about 400 of the same
ID: 3356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 13
Message 3357 - Posted: 14 May 2016, 7:52:49 UTC
Last modified: 14 May 2016, 8:09:38 UTC

I got the same endless job again:

===> [runRivet] Sat May 14 09:21:16 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220]
...
...
Event 2000 ( 1m 13s elapsed / 1h 8s left ) -> ETA: Sat May 14 10:44
2000 events processed
dumping histograms...
Event 2100 ( 1m 18s elapsed / 1h 39s left ) -> ETA: Sat May 14 10:44
2100 events processed
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
etc etc etc

Edit: My mistake, it's the same job, but not the same project -- Summer Challenge 2015 and not vLHCathome-dev, but in the end the jobs are coming from the same pool.
ID: 3357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3364 - Posted: 15 May 2016, 20:59:01 UTC
Last modified: 15 May 2016, 21:21:45 UTC

It seems, the same job is run numerous times, one after the other.
Exact same parameters, exat same log lenght.
Up to 12 times.

Is that deliberate??

Prepare Rivet parameters ...
analysesNames=CDF_2000_S4155203


Same across even different tasks.
ID: 3364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 13
Message 3365 - Posted: 16 May 2016, 10:38:10 UTC

===> [runRivet] Mon May 16 11:50:59 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264]
...
...
18.5146 pb +- ( 0.0356486 pb = 0.192543 % ) 300000 ( 631214 -> 53.3 % )
integration time: ( 7m 56s (7m 31s) elapsed / 16s (15s) left ) [12:01:08]
18.5087 pb +- ( 0.034976 pb = 0.188971 % ) 310000 ( 650036 -> 53.2 % )
integration time: ( 8m 11s (7m 45s) elapsed / 0s (0s) left ) [12:01:23]
2_3__j__j__e-__e+__j : 18.5087 pb +- ( 0.034976 pb = 0.188971 % ) exp. eff: 1.15398 %
reduce max for 2_3__j__j__e-__e+__j to 0.522748 ( eps = 0.001 )
Process_Group::CalculateTotalXSec(): Calculate xs for '2_4__j__j__e-__e+__j__j' (Comix)
Starting the calculation at 12:01:24. Lean back and enjoy ... .
Updating display...
Display update finished (0 histograms, 0 events).
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
}

Exception_Handler::SignalHandler: Signal (6) caught.
Cannot continue.
Exception_Handler::GenerateStackTrace(..): Generating stack trace
{
}

32 times the above Exception and then only

Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
etc etc etc
ID: 3365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3366 - Posted: 16 May 2016, 10:56:18 UTC - in response to Message 3365.  
Last modified: 16 May 2016, 11:03:51 UTC

I have the same type of job.

CPU load very high >2.29 (15min average) and that is with 2 running cores.

With only one core the load must be EXTREMLY high.

EDIT: It appears to run the exact same job on 3 tasks simultaneously.
ID: 3366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 3367 - Posted: 16 May 2016, 12:40:58 UTC - in response to Message 3364.  

It seems, the same job is run numerous times, one after the other.
Exact same parameters, exat same log lenght.
Up to 12 times.

Is that deliberate??

Prepare Rivet parameters ...
analysesNames=CDF_2000_S4155203


Same across even different tasks.

We are looking at this. It is probably a Condor rerun when a job terminates with a nonzero code, and looping would do that.
ID: 3367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 674
Credit: 1,960,521
RAC: 1,063
Message 3368 - Posted: 16 May 2016, 19:04:12 UTC - in response to Message 3357.  
Last modified: 16 May 2016, 19:11:51 UTC

Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
Display update finished (55 histograms, 2000 events).
Updating display...
etc etc etc


Challenge-Task, this morning the same Error for me for about 10 hours running,
also a sherpa, other sherpa running well,
In Challenge there is databridge and no condor.

After destroying this task, it run up to now well with other tasks.

Edit: This was from midnight (German-Time) last night up to 10 hour
in the morning.
ID: 3368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3369 - Posted: 16 May 2016, 22:48:31 UTC - in response to Message 3367.  

Dear all,
the problem of running the same job several times should be related to the transition between manual and automatci job submission/retrieval. Please notify if you see this still happening.

Thank you,
Leonardo
ID: 3369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3371 - Posted: 17 May 2016, 12:24:36 UTC - in response to Message 3369.  
Last modified: 17 May 2016, 12:57:32 UTC

It is still happening.
I had 2 jobs in a row with identical parameters.

EDIT:And i have two separte tasks running the same job simultaneiously.
ID: 3371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 4
Message 3372 - Posted: 17 May 2016, 12:55:07 UTC - in response to Message 3371.  

Our working hypothesis is that for jobs which have endless loops, they are killed when the VM reaches the 18h time limit. Condor sees this as a failed job then resubmit it up to 5 times. We will disable this feature and let MCPlots handle it.
ID: 3372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3373 - Posted: 17 May 2016, 12:58:24 UTC - in response to Message 3372.  

I had a job running 11 times in a row.
ID: 3373 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3374 - Posted: 17 May 2016, 13:05:54 UTC - in response to Message 3373.  

Hi Rasputin,
can you please retrieve the job executable name from any of them?
It should be something like "2016-5xxxxx-2xx.tgz".
ID: 3374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3375 - Posted: 17 May 2016, 14:17:32 UTC - in response to Message 3374.  

Where would that be?

I checked all logs for ".tgz" found nothing.
ID: 3375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3376 - Posted: 17 May 2016, 14:36:00 UTC - in response to Message 3375.  

Sorry, my typo: I meant ".run", not ".tgz".
By the way you can find the three numbers as revision=2016, runid=5xxxxx, seed=2xx in some output file.
ID: 3376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3377 - Posted: 17 May 2016, 15:26:40 UTC - in response to Message 3376.  
Last modified: 17 May 2016, 15:37:21 UTC

There is no "revision=" and no "runid=".

There is a "seed=" and "version=".

I have two seperate tasks running the same job at the moment.

EDIT: bad example, tune is different


Input parameters:
mode=boinc
beam=ppbar
process=z
energy=1800
params=-,-,50,130
specific=-
generator=pythia6
version=6.428
tune=394
nevts=100000
seed=267


Prepare Rivet parameters ...
analysesNames=CDF_2000_S4155203


======================================================================

Input parameters:
mode=boinc
beam=ppbar
process=z
energy=1800
params=-,-,50,130
specific=-
generator=pythia6
version=6.428
tune=352
nevts=100000
seed=267



Prepare Rivet parameters ...
analysesNames=CDF_2000_S4155203
ID: 3377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3378 - Posted: 17 May 2016, 15:49:42 UTC
Last modified: 17 May 2016, 15:58:27 UTC

EDIT:It appears, that the log files of two Boinc-task, started at the same time are virtually identical, just slight differences in time stamps.

Better examples. Two jobs run on two different tasks at nearly the same time.

===> [runRivet] Tue May 17 14:28:20 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267]
===> [runRivet] Tue May 17 14:28:15 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia6 6.428 363 100000 267]

===> [runRivet] Tue May 17 16:37:50 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267]
===> [runRivet] Tue May 17 16:38:43 CEST 2016 [boinc ppbar z 1800 -,-,50,130 - pythia6 6.428 394 100000 267]
ID: 3378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3379 - Posted: 17 May 2016, 16:08:06 UTC

Here are log file index of two DIFFERNT tasks:

[DIR] Parent Directory -
[ ] MasterLog 17-May-2016 15:13 1.9K
[ ] StartLog 17-May-2016 18:03 6.8K
[ ] StarterLog 17-May-2016 18:03 24K
[ ] finished_1.log 17-May-2016 14:28 45K
[ ] finished_2.log 17-May-2016 15:23 57K
[ ] finished_3.log 17-May-2016 16:05 54K
[ ] finished_4.log 17-May-2016 16:37 44K
[ ] finished_5.log 17-May-2016 17:14 58K
[ ] finished_6.log 17-May-2016 17:56 57K
[ ] finished_7.log 17-May-2016 18:03 70K
[ ] running.log 17-May-2016 18:05 40K
[ ] stderr.log 17-May-2016 18:03 630
[TXT] stdout.log 17-May-2016 18:03 1.8K

[DIR] Parent Directory -
[ ] MasterLog 17-May-2016 15:13 1.9K
[ ] StartLog 17-May-2016 18:04 6.8K
[ ] StarterLog 17-May-2016 18:04 24K
[ ] finished_1.log 17-May-2016 14:28 45K
[ ] finished_2.log 17-May-2016 15:24 57K
[ ] finished_3.log 17-May-2016 16:06 54K
[ ] finished_4.log 17-May-2016 16:38 44K
[ ] finished_5.log 17-May-2016 17:14 58K
[ ] finished_6.log 17-May-2016 17:57 57K
[ ] finished_7.log 17-May-2016 18:04 70K
[ ] running.log 17-May-2016 18:05 34K
[ ] stderr.log 17-May-2016 18:04 630
[TXT] stdout.log 17-May-2016 18:05 2.0K
ID: 3379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3380 - Posted: 17 May 2016, 16:20:23 UTC
Last modified: 17 May 2016, 16:33:16 UTC

EDIT:(It looks to me as if there is a certain time window, when USER/COMPUTER requests jobs, he gets only one specific job, regardless , if it is a different task or even different computer, as long as it is in the time window?

I cannot test it with different computers, as i have only one.)

Here are two starting lines of jobs of the SAME TASK:

===> [runRivet] Tue May 17 18:03:43 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267]
===> [runRivet] Tue May 17 17:56:21 CEST 2016 [boinc ppbar uemb-soft 63 - - pythia8 8.212 tune-AU2ct10 100000 267]
ID: 3380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3381 - Posted: 17 May 2016, 17:19:49 UTC
Last modified: 17 May 2016, 17:20:17 UTC

I am stopping computing for this sub-project, until some things are fixed.

I have now the same job running, i had yesterday with the same MC-plot id number and parameters, but a different condor job id number.

I really do not want to calculate the same job over and over.

This job appers to be mixing two actions together:
===> [runRivet] Tue May 17 16:12:35 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264]

Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
2.44551 pb +- ( 0.0399697 pb = 1.63441 % ) 25000 ( 98162 -> 33.4 % )
full optimization: ( 8m (9m 41s) elapsed / 1h 31m 17s (1h 50m 28s) left ) [16:36:37]
Updating display...
Display update finished (0 histograms, 0 events).
2.42937 pb +- ( 0.03419 pb = 1.40736 % ) 30000 ( 112822 -> 34.1 % )
full optimization: ( 9m 32s (11m 34s) elapsed / 1h 29m 3s (1h 47m 54s) left ) [16:38:09]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
2.41093 pb +- ( 0.0253617 pb = 1.05195 % ) 40000 ( 141922 -> 34.3 % )
full optimization: ( 12m 45s (15m 24s) elapsed / 1h 26m 6s (1h 43m 54s) left ) [16:41:21]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
2.3967 pb +- ( 0.0206849 pb = 0.863057 % ) 50000 ( 172547 -> 32.6 % )
full optimization: ( 16m 6s (19m 16s) elapsed / 1h 23m 43s (1h 40m 12s) left ) [16:44:42]
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
Updating display...
Display update finished (0 histograms, 0 events).
ID: 3381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Theory Application : Endless Theory job


©2024 CERN