Message boards : Theory Application : Endless Theory job
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 825,996
RAC: 1,062
Message 3382 - Posted: 17 May 2016, 19:32:25 UTC - in response to Message 3381.  

This job appers to be mixing two actions together:
===> [runRivet] Tue May 17 16:12:35 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - sherpa 2.1.0 default 90000 264]

That's the same job I reported yesterday with the exception messages.
What you have shown is so far so good.
The only useless action is the "updating display" during initializing and optimization phases of a sherpa job.
Histograms can only be created when a few thousand events are processed.
But later on that job it starts looping after 32 the same exception messages.
ID: 3382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 671
Credit: 1,881,771
RAC: 7,145
Message 3386 - Posted: 18 May 2016, 11:36:53 UTC

Challenge is running Sherpa-Tasks,
when they finished correct in about 8.500 seconds - Two and a half hour duration.

Saw this today on two own PC's.

Other tasks under Challenge (for example Herwig++) need up to half an hour.
ID: 3386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 825,996
RAC: 1,062
Message 3387 - Posted: 18 May 2016, 16:25:02 UTC - in response to Message 3386.  

Challenge is running Sherpa-Tasks,
when they finished correct in about 8.500 seconds - Two and a half hour duration.

The duration of a Sherpa job fully depends on the used parameters.
In the past I've even seen sherpa jobs running longer than 24 hours when nevts=100000.

A few hours ago I had a looping Sherpa again in a Challenge 2015 VM:

===> [runRivet] Wed May 18 14:20:15 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220]
ID: 3387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3390 - Posted: 19 May 2016, 8:30:25 UTC

Is there any progress on the multiple processing of the same job issue, yet?

It would really be nice , if admin could, every once in a while , actually say something?
Just a few words, that we know, admin is still alive.

We talked about improving communication a while back.It did not help.
ID: 3390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3391 - Posted: 19 May 2016, 9:03:48 UTC - in response to Message 3390.  

We disabled the Condor jobs automatic resubmission for the moment.
We are not sure it was the cause of duplicating job so it would be good to hear back from you.

Many thanks.
ID: 3391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3395 - Posted: 19 May 2016, 13:45:32 UTC
Last modified: 19 May 2016, 13:49:53 UTC

The very first two tasks, i started are identical jobs.

0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
4===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]

Output of the job wrapper may appear here.
15:13:54 +0200 2016-05-19 [INFO] New Job Starting
15:13:54 +0200 2016-05-19 [INFO] Condor JobID: 314816
15:13:59 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463


Output of the job wrapper may appear here.
15:13:54 +0200 2016-05-19 [INFO] New Job Starting
15:13:54 +0200 2016-05-19 [INFO] Condor JobID: 314815
15:13:59 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
ID: 3395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3396 - Posted: 19 May 2016, 13:58:20 UTC - in response to Message 3395.  

So it looks like duplicate jobs appear only in different tasks, never in the same task.
Can you please retrieve the MC-plot id number and the Condor job one for them?
Do you see a "jobdata" file in the job execution dir?
ID: 3396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3397 - Posted: 19 May 2016, 14:02:03 UTC - in response to Message 3396.  
Last modified: 19 May 2016, 14:08:14 UTC

So it looks like duplicate jobs appear only in different tasks, never in the same task.

Incorrect.


The second job in each task is identical to the ones before

0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
4===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
0===> [runRivet] Thu May 19 15:58:07 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
4===> [runRivet] Thu May 19 15:57:32 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]


Output of the job wrapper may appear here.
15:13:54 +0200 2016-05-19 [INFO] New Job Starting
15:13:54 +0200 2016-05-19 [INFO] Condor JobID: 314816
15:13:59 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
15:58:03 +0200 2016-05-19 [INFO] Job finished with 0.
15:58:07 +0200 2016-05-19 [INFO] New Job Starting
15:58:07 +0200 2016-05-19 [INFO] Condor JobID: 314840
15:58:13 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463


Output of the job wrapper may appear here.
15:13:54 +0200 2016-05-19 [INFO] New Job Starting
15:13:54 +0200 2016-05-19 [INFO] Condor JobID: 314815
15:13:59 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
15:57:28 +0200 2016-05-19 [INFO] Job finished with 0.
15:57:31 +0200 2016-05-19 [INFO] New Job Starting
15:57:32 +0200 2016-05-19 [INFO] Condor JobID: 314839
15:57:37 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
ID: 3397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3398 - Posted: 19 May 2016, 14:47:10 UTC

I finally got a differnt job.

0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
4===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
0===> [runRivet] Thu May 19 15:58:07 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
4===> [runRivet] Thu May 19 15:57:32 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268]
0===> [runRivet] Thu May 19 16:41:51 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.210 default-CD 100000 268]
4===> [runRivet] Thu May 19 16:41:12 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - herwig++ 2.5.2 LHC-UE-EE-3-2760 100000 268]


The first number in each row is a slot number to tell the tasks apart.
ID: 3398 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3399 - Posted: 19 May 2016, 14:54:17 UTC - in response to Message 3398.  

Ok thanks, we are investigating with Condor experts.
If you can find the "jobdata" file in the job execution dir it would be helpful to know the "runid" and "seed" values listed in the first lines.
ID: 3399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3400 - Posted: 19 May 2016, 14:59:43 UTC - in response to Message 3399.  

Do you mean running.log?
ID: 3400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3401 - Posted: 19 May 2016, 15:03:51 UTC - in response to Message 3400.  

I don't know if "runid" and "seed" are written in the running.log, for sure they are in "jobdata".
Wherever you find them is fine.
ID: 3401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3402 - Posted: 19 May 2016, 15:07:09 UTC - in response to Message 3401.  

Ther is no such word in any of the "show graphics" log files.

Do you mean this:
Unpack data histograms...
dataFiles =
/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet/D0_2010_S8671338.yoda
/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet/D0_2010_S8821313.yoda
output = /var/lib/condor/execute/dir_4349/tmp/tmp.5sRZTQbz2H/flat
make: Entering directory `/var/lib/condor/execute/dir_4349/rivetvm'
g++ yoda2flat-split.cc -o yoda2flat-split.exe -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/yoda/1.5.5/x86_64-slc6-gcc47-opt/lib `/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/yoda/1.5.5/x86_64-slc6-gcc47-opt/bin/yoda-config --cppflags --libs`
make: Leaving directory `/var/lib/condor/execute/dir_4349/rivetvm'
ID: 3402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3403 - Posted: 19 May 2016, 15:11:56 UTC - in response to Message 3402.  

The "jobdata" file should be in /var/lib/condor/execute/dir_4349/
as from your output.
ID: 3403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3404 - Posted: 19 May 2016, 15:15:28 UTC - in response to Message 3403.  

You should know, as a volunteer, i cannot get into the virtual box.
ID: 3404 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3406 - Posted: 19 May 2016, 16:52:06 UTC

After a one job break the same task gets the same job again:

Output of the job wrapper may appear here.
15:13:54 +0200 2016-05-19 [INFO] New Job Starting
15:13:54 +0200 2016-05-19 [INFO] Condor JobID: 314815
15:13:59 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
15:57:28 +0200 2016-05-19 [INFO] Job finished with 0.
15:57:31 +0200 2016-05-19 [INFO] New Job Starting
15:57:32 +0200 2016-05-19 [INFO] Condor JobID: 314839
15:57:37 +0200 2016-05-19 [INFO] MCPlots JobID: 30486463
16:41:08 +0200 2016-05-19 [INFO] Job finished with 0.
16:41:12 +0200 2016-05-19 [INFO] New Job Starting
16:41:12 +0200 2016-05-19 [INFO] Condor JobID: 314866
16:41:17 +0200 2016-05-19 [INFO] MCPlots JobID: 30484435
17:17:10 +0200 2016-05-19 [INFO] Job finished with 0.
17:17:14 +0200 2016-05-19 [INFO] New Job Starting
17:17:14 +0200 2016-05-19 [INFO] Condor JobID: 314894
17:17:19 +0200 2016-05-19 [INFO] MCPlots JobID: 30484603
17:48:52 +0200 2016-05-19 [INFO] Job finished with 0.
17:48:56 +0200 2016-05-19 [INFO] New Job Starting
17:48:56 +0200 2016-05-19 [INFO] Condor JobID: 314922
17:49:01 +0200 2016-05-19 [INFO] MCPlots JobID: 30483844
18:44:47 +0200 2016-05-19 [INFO] Job finished with 0.
18:44:52 +0200 2016-05-19 [INFO] New Job Starting
18:44:52 +0200 2016-05-19 [INFO] Condor JobID: 314973
18:44:57 +0200 2016-05-19 [INFO] MCPlots JobID: 30484603
ID: 3406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3409 - Posted: 19 May 2016, 20:20:50 UTC

After running a number of jobs: More than 90% of all new jobs are repeated.

That is only within one computer. I do not know, how may times the same job has been run on other computers.
ID: 3409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3410 - Posted: 19 May 2016, 20:32:37 UTC - in response to Message 3409.  

Ok I will try to submit jobs from a different place every time.
I will let you know when "fresh" jobs will be available.

Thanks
ID: 3410 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3411 - Posted: 19 May 2016, 20:38:17 UTC - in response to Message 3410.  

Thanks Leonardo.
ID: 3411 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Leonardo Cristella

Send message
Joined: 4 Mar 16
Posts: 31
Credit: 44,320
RAC: 0
Message 3412 - Posted: 19 May 2016, 21:54:16 UTC - in response to Message 3411.  

From Condor JobID 316094 onwards jobs are submitted from a different location every 15'.
Let's see if that solves the problem.
ID: 3412 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Theory Application : Endless Theory job


©2024 CERN