Message boards : Theory Application : Endless Theory job
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
This job appers to be mixing two actions together: That's the same job I reported yesterday with the exception messages. What you have shown is so far so good. The only useless action is the "updating display" during initializing and optimization phases of a sherpa job. Histograms can only be created when a few thousand events are processed. But later on that job it starts looping after 32 the same exception messages. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Challenge is running Sherpa-Tasks, when they finished correct in about 8.500 seconds - Two and a half hour duration. Saw this today on two own PC's. Other tasks under Challenge (for example Herwig++) need up to half an hour. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 72 ![]() ![]() |
Challenge is running Sherpa-Tasks, The duration of a Sherpa job fully depends on the used parameters. In the past I've even seen sherpa jobs running longer than 24 hours when nevts=100000. A few hours ago I had a looping Sherpa again in a Challenge 2015 VM: ===> [runRivet] Wed May 18 14:20:15 CEST 2016 [boinc ee zhad 133 - - sherpa 1.2.3 default 100000 220] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Is there any progress on the multiple processing of the same job issue, yet? It would really be nice , if admin could, every once in a while , actually say something? Just a few words, that we know, admin is still alive. We talked about improving communication a while back.It did not help. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
We disabled the Condor jobs automatic resubmission for the moment. We are not sure it was the cause of duplicating job so it would be good to hear back from you. Many thanks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
The very first two tasks, i started are identical jobs. 0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268] 4===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268] Output of the job wrapper may appear here. Output of the job wrapper may appear here. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
So it looks like duplicate jobs appear only in different tasks, never in the same task. Can you please retrieve the MC-plot id number and the Condor job one for them? Do you see a "jobdata" file in the job execution dir? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
So it looks like duplicate jobs appear only in different tasks, never in the same task. Incorrect. The second job in each task is identical to the ones before 0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268] Output of the job wrapper may appear here. Output of the job wrapper may appear here. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I finally got a differnt job. 0===> [runRivet] Thu May 19 15:13:54 CEST 2016 [boinc ppbar z 1960 -,-,50,120 - pythia8 8.212 default-CD 100000 268] The first number in each row is a slot number to tell the tasks apart. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Ok thanks, we are investigating with Condor experts. If you can find the "jobdata" file in the job execution dir it would be helpful to know the "runid" and "seed" values listed in the first lines. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Do you mean running.log? |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
I don't know if "runid" and "seed" are written in the running.log, for sure they are in "jobdata". Wherever you find them is fine. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Ther is no such word in any of the "show graphics" log files. Do you mean this: Unpack data histograms... dataFiles = /cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet/D0_2010_S8671338.yoda /cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/rivet/2.4.0/x86_64-slc6-gcc47-opt/share/Rivet/D0_2010_S8821313.yoda output = /var/lib/condor/execute/dir_4349/tmp/tmp.5sRZTQbz2H/flat make: Entering directory `/var/lib/condor/execute/dir_4349/rivetvm' g++ yoda2flat-split.cc -o yoda2flat-split.exe -Wfatal-errors -Wl,-rpath /cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/yoda/1.5.5/x86_64-slc6-gcc47-opt/lib `/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt65/yoda/1.5.5/x86_64-slc6-gcc47-opt/bin/yoda-config --cppflags --libs` make: Leaving directory `/var/lib/condor/execute/dir_4349/rivetvm' |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
The "jobdata" file should be in /var/lib/condor/execute/dir_4349/ as from your output. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
You should know, as a volunteer, i cannot get into the virtual box. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
After a one job break the same task gets the same job again: Output of the job wrapper may appear here. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
After running a number of jobs: More than 90% of all new jobs are repeated. That is only within one computer. I do not know, how may times the same job has been run on other computers. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
Ok I will try to submit jobs from a different place every time. I will let you know when "fresh" jobs will be available. Thanks |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks Leonardo. |
![]() Send message Joined: 4 Mar 16 Posts: 31 Credit: 44,320 RAC: 0 ![]() |
From Condor JobID 316094 onwards jobs are submitted from a different location every 15'. Let's see if that solves the problem. |
©2025 CERN