Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 12 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1559 - Posted: 5 Jan 2016, 9:52:02 UTC

Here is one:
Job 3756 Time, about 7.15 UTC (8.15 local time)

Another job was started successfully about 2 minutes later.

cmsRun-stderr.log:

Beginning CMSSW wrapper script
slc6_amd64_gcc472 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py

Here is starter.log:

01/05/16 05:20:19 (pid:3602) Create_Process succeeded, pid=3608
01/05/16 08:14:54 (pid:3602) Process exited, pid=3608, status=0
01/05/16 08:14:56 (pid:3602) Got SIGQUIT. Performing fast shutdown.
01/05/16 08:14:56 (pid:3602) ShutdownFast all jobs.
01/05/16 08:14:56 (pid:3602) **** condor_starter (condor_STARTER) pid 3602 EXITING WITH STATUS 0
01/05/16 08:14:57 (pid:11108) ******************************************************
01/05/16 08:14:57 (pid:11108) ** condor_starter (CONDOR_STARTER) STARTING UP
01/05/16 08:14:57 (pid:11108) ** /home/boinc/CMSRun/glide_IZnvNB/main/condor/sbin/condor_starter
01/05/16 08:14:57 (pid:11108) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/05/16 08:14:57 (pid:11108) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/05/16 08:14:57 (pid:11108) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
01/05/16 08:14:57 (pid:11108) ** $CondorPlatform: x86_64_RedHat5 $
01/05/16 08:14:57 (pid:11108) ** PID = 11108
01/05/16 08:14:57 (pid:11108) ** Log last touched 1/5 08:14:56
01/05/16 08:14:57 (pid:11108) ******************************************************
01/05/16 08:14:57 (pid:11108) Using config source: /home/boinc/CMSRun/glide_IZnvNB/condor_config
01/05/16 08:14:57 (pid:11108) config Macros = 215, Sorted = 215, StringBytes = 10700, TablesBytes = 7780
01/05/16 08:14:57 (pid:11108) CLASSAD_CACHING is OFF
01/05/16 08:14:57 (pid:11108) Daemon Log is logging: D_ALWAYS D_ERROR
01/05/16 08:14:57 (pid:11108) DaemonCore: command socket at <10.0.2.15:46929?noUDP>
01/05/16 08:14:57 (pid:11108) DaemonCore: private command socket at <10.0.2.15:46929>
01/05/16 08:14:58 (pid:11108) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#83347
01/05/16 08:14:58 (pid:11108) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2048_6ed1_78246>
01/05/16 08:14:58 (pid:11108) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
01/05/16 08:14:58 (pid:11108) setting the orig job name in starter
01/05/16 08:14:58 (pid:11108) setting the orig job iwd in starter
01/05/16 08:14:58 (pid:11108) Chirp config summary: IO false, Updates false, Delayed updates true.
01/05/16 08:14:58 (pid:11108) Initialized IO Proxy.
01/05/16 08:14:58 (pid:11108) Done setting resource limits
01/05/16 08:14:58 (pid:11108) FILETRANSFER: "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/05/16 08:14:58 (pid:11108) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/05/16 08:15:02 (pid:11108) File transfer completed successfully.
01/05/16 08:15:02 (pid:11108) Job 966446.0 set to execute immediately
01/05/16 08:15:02 (pid:11108) Starting a VANILLA universe job with ID: 966446.0
01/05/16 08:15:02 (pid:11108) IWD: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108
01/05/16 08:15:02 (pid:11108) Output file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/_condor_stdout
01/05/16 08:15:02 (pid:11108) Error file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/_condor_stderr
01/05/16 08:15:02 (pid:11108) Renice expr "0" evaluated to 0
01/05/16 08:15:02 (pid:11108) Using wrapper /home/boinc/CMSRun/glide_IZnvNB/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3756 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3756.txt --runAndLumis=job_lumis_3756.json --lheInputFiles=False --firstEvent=1126501 --firstLumi=11266 --lastEvent=1126801 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
01/05/16 08:15:02 (pid:11108) Running job as user (null)
01/05/16 08:15:02 (pid:11108) Create_Process succeeded, pid=11133
01/05/16 08:15:04 (pid:11108) Got SIGTERM. Performing graceful shutdown.
01/05/16 08:15:04 (pid:11108) ShutdownGraceful all jobs.
01/05/16 08:17:04 (pid:11108) ShutdownFast all jobs.
01/05/16 08:17:04 (pid:11108) Process exited, pid=11133, signal=9
01/05/16 08:17:04 (pid:11108) Last process exited, now Starter is exiting
01/05/16 08:17:04 (pid:11108) **** condor_starter (condor_STARTER) pid 11108 EXITING WITH STATUS 0
01/05/16 08:17:06 (pid:11516) ******************************************************
01/05/16 08:17:06 (pid:11516) ** condor_starter (CONDOR_STARTER) STARTING UP
01/05/16 08:17:06 (pid:11516) ** /home/boinc/CMSRun/glide_IZnvNB/main/condor/sbin/condor_starter
01/05/16 08:17:06 (pid:11516) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/05/16 08:17:06 (pid:11516) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/05/16 08:17:06 (pid:11516) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
01/05/16 08:17:06 (pid:11516) ** $CondorPlatform: x86_64_RedHat5 $
01/05/16 08:17:06 (pid:11516) ** PID = 11516
01/05/16 08:17:06 (pid:11516) ** Log last touched 1/5 08:17:04
01/05/16 08:17:06 (pid:11516) ******************************************************
01/05/16 08:17:06 (pid:11516) Using config source: /home/boinc/CMSRun/glide_IZnvNB/condor_config
01/05/16 08:17:06 (pid:11516) config Macros = 215, Sorted = 215, StringBytes = 10700, TablesBytes = 7780
01/05/16 08:17:06 (pid:11516) CLASSAD_CACHING is OFF
01/05/16 08:17:06 (pid:11516) Daemon Log is logging: D_ALWAYS D_ERROR
01/05/16 08:17:06 (pid:11516) DaemonCore: command socket at <10.0.2.15:38509?noUDP>
01/05/16 08:17:06 (pid:11516) DaemonCore: private command socket at <10.0.2.15:38509>
01/05/16 08:17:06 (pid:11516) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#83348
01/05/16 08:17:06 (pid:11516) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2048_6ed1_78249>
01/05/16 08:17:06 (pid:11516) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
01/05/16 08:17:06 (pid:11516) setting the orig job name in starter
01/05/16 08:17:06 (pid:11516) setting the orig job iwd in starter
01/05/16 08:17:06 (pid:11516) Chirp config summary: IO false, Updates false, Delayed updates true.
01/05/16 08:17:06 (pid:11516) Initialized IO Proxy.
01/05/16 08:17:06 (pid:11516) Done setting resource limits
01/05/16 08:17:06 (pid:11516) FILETRANSFER: "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/05/16 08:17:06 (pid:11516) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/05/16 08:17:12 (pid:11516) File transfer completed successfully.
01/05/16 08:17:12 (pid:11516) Job 966449.0 set to execute immediately
01/05/16 08:17:12 (pid:11516) Starting a VANILLA universe job with ID: 966449.0
01/05/16 08:17:12 (pid:11516) IWD: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516
01/05/16 08:17:12 (pid:11516) Output file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/_condor_stdout
01/05/16 08:17:12 (pid:11516) Error file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/_condor_stderr
01/05/16 08:17:12 (pid:11516) Renice expr "0" evaluated to 0
01/05/16 08:17:12 (pid:11516) Using wrapper /home/boinc/CMSRun/glide_IZnvNB/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3761 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3761.txt --runAndLumis=job_lumis_3761.json --lheInputFiles=False --firstEvent=1128001 --firstLumi=11281 --lastEvent=1128301 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
01/05/16 08:17:12 (pid:11516) Running job as user (null)
01/05/16 08:17:12 (pid:11516) Create_Process succeeded, pid=11523
ID: 1559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1560 - Posted: 5 Jan 2016, 10:47:17 UTC

Yet another "runaway".
Exit code 8022, 5+ fails in a row.
ID: 1560 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1561 - Posted: 5 Jan 2016, 17:43:21 UTC - in response to Message 1560.  

Yet another "runaway".
Exit code 8022, 5+ fails in a row.

It's getting critical that we find a way of automatically detecting and notifying these hosts. This looks like another one that was switched off for a few days.
ID: 1561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1562 - Posted: 5 Jan 2016, 18:11:11 UTC

Is there a way, that the server blocks a host after ,lets say, 3 consecutive fails and causes the cms boinc task to error out?(or send a pm?)

I know, that is not very nice, but i have no idea, what the capabilities of the servers are.
At least, this way, the volunteer will know, that something went wrong on a boinc level.
ID: 1562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1564 - Posted: 6 Jan 2016, 11:08:13 UTC - in response to Message 1562.  

We're having a Cloud Services meeting today, I'm just now preparing slides. I hope I can remember all the problems of the last 4 weeks, but I'm putting in a link to this thread. :-)
ID: 1564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1567 - Posted: 7 Jan 2016, 18:01:01 UTC

I had a job(1962) fail after 3 minutes.

cmsrun stdout:
Beginning CMSSW wrapper script
slc6_amd64_gcc472 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py




The the server lists:" Error returned without specification-Post-processing step failed".
How can it be "post processing",if the job has not even started?


However the end of condorstdout reads:

==== CMSSW JOB Execution started at Thu Jan 7 16:40:35 2016 ====
2016-01-07 16:40:40,172:INFO:CMSSW:User files are
2016-01-07 16:40:40,172:INFO:CMSSW:User sandboxes are sandbox.tar.gz
2016-01-07 16:40:40,173:INFO:CMSSW:CMSSW configured for 1 cores
2016-01-07 16:40:40,173:INFO:CMSSW:Executing CMSSW step
2016-01-07 16:40:40,173:INFO:CMSSW:Runing SCRAM
2016-01-07 16:40:43,599:INFO:CMSSW:Running PRE scripts
2016-01-07 16:40:43,599:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2016-01-07 16:40:43,600:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/TweakPSet.py --location=/home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785 --inputFile='job_input_file_list_1962.txt' --runAndLumis='job_lumis_1962.json' --firstEvent=588301 --lastEvent=588601 --firstLumi=5884 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100

2016-01-07 16:40:48,747:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', '']
/home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/condor_exec.exe: line 12: 12033 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode
./CMSRunAnalysis.sh: line 159: 12025 Killed python CMSRunAnalysis.py -r "`pwd`" "$@"
+ jobrc=137
+ set +x
== The job had an exit code of 137


This time it has an exit code, that was not reported to the server.
If you need more info, please tell me.
ID: 1567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1576 - Posted: 10 Jan 2016, 16:26:54 UTC

I have a job (1711), which continues to run.
However, Dashboard lists it as having failed (60311).

Why does it keep running?
ID: 1576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1577 - Posted: 10 Jan 2016, 17:28:10 UTC - in response to Message 1576.  

I have a job (1711), which continues to run.
However, Dashboard lists it as having failed (60311).

Why does it keep running?

I don't have any logs back from it yet -- it may be in post-proc. Dashboard reckons it finished ~90 mins ago. Can you post the last 8 or 10 lines from the cmsRun_stdout file? Is it using any CPU, can you tell?
ID: 1577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1578 - Posted: 10 Jan 2016, 17:55:34 UTC - in response to Message 1577.  

Well, Dashboard was wrong! It's corrected itself now though, 1711 is now in "finshed" state. The output has reached the Data-bridge and the last lines of my log file are:

====== Sun Jan 10 17:40:18 2016: cmscp.py FINISHING (status 0).
======== Stageout at Sun Jan 10 17:40:18 GMT 2016 FINISHING (short status 0) ========
======== gWMS-CMSRunAnalysis.sh FINISHING at Sun Jan 10 17:40:18 GMT 2016 on 277-617-19427 with (short) status 0 ========
Local time: Sun Jan 10 18:40:18 CET 2016
Short exit status: 0
Job Running time in seconds: 6012

ID: 1578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1579 - Posted: 10 Jan 2016, 18:00:22 UTC

It just finished normally.
No indication, that anything went wrong.

The IP address listed is not mine, but i had this before without any problem.

I find it rather strange, that a job, that was not finished seem to be rerun by another machine,without starting a new attempt.
ID: 1579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1580 - Posted: 10 Jan 2016, 20:13:00 UTC - in response to Message 1579.  

It just finished normally.
No indication, that anything went wrong.

The IP address listed is not mine, but i had this before without any problem.

I find it rather strange, that a job, that was not finished seem to be rerun by another machine,without starting a new attempt.

This turns out to be a known problem that has been seen by other CMS production teams, and is under active investigation -- well, at least the retention of the original IP through Condor retries (I got some good feedback from the two presentations I gave to the Collaboration last week; I've got another this week but that's just to the UK members so there won't be much technical feedback there). I will have to do some digging around to investigate why Condor sometimes "loses" a job and reassigns it well within the 12-hour time-out I have set (the default was 20 minutes, perhaps I'm setting the wrong parameter).
ID: 1580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1581 - Posted: 10 Jan 2016, 20:33:52 UTC - in response to Message 1580.  

Thanks for the Info, Ivan.
There are 10 out of 48 results, that did not have my IP address. So roughly 20%.
All finished properly.
ID: 1581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1582 - Posted: 10 Jan 2016, 21:11:24 UTC - in response to Message 1581.  

Thanks for the Info, Ivan.
There are 10 out of 48 results, that did not have my IP address. So roughly 20%.
All finished properly.

Good. Turns out that I had the wrong impression of how Condor works with its retries. I was under the impression that all job errors led to a retry, up to a maximum of three tries, so I would see a dummy log file job_out.nnnn.0.txt until the job ran, when it would be replaced by the actual log. If there was a failure then another dummy job_out.nnnn.1.txt would be created, to be replaced by the actual log, and futher to another failure dummy job_out.nnnn.2.txt would be created, replaced by the actual log again, but in this case no further retries occur after a failure.
However, if there's a time-out no log file is returned and Condor just requeues the job and when output finally arrives it overwrites the current dummy for that job. So, we could have jobs time-out several times, e.g. when the Volunteers switches his host off for too long, or if there's a job running when the 24-hour task limit runs out, and these aren't seen (in my logs at least), but some of the info does get relayed to Dashboard -- leading to confusion...
Now, see my next post for a partial easing of this problem!
ID: 1582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1583 - Posted: 10 Jan 2016, 21:22:11 UTC - in response to Message 1580.  

I will have to do some digging around to investigate why Condor sometimes "loses" a job and reassigns it well within the 12-hour time-out I have set (the default was 20 minutes, perhaps I'm setting the wrong parameter).

Oops! mea culpa it would seem. I've been setting a Condor parameter LeaseDuration to 12 hours, to extend the time before Condor thinks a job is lost. But I got it wrong, it should be JobLeaseDuration! Seems Condor lets you set anything (much like an environment variable in Linux or Windows) without it necessarily being a valid name... I'll have to go back to my email correspondence with Andrew about six months ago to see where I got it wrong, could just be my failing eyesight. (The default was two hours, by the way, so my memory is failing too...)

So, let's see what effect correcting that has on job efficiencies, wrong IP attributions, etc.
ID: 1583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1584 - Posted: 10 Jan 2016, 22:05:45 UTC

Please let us know, when you change the parameter.
ID: 1584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1585 - Posted: 10 Jan 2016, 22:29:45 UTC - in response to Message 1584.  
Last modified: 10 Jan 2016, 22:30:48 UTC

Please let us know, when you change the parameter.

I have changed it already (2100 GMT); let us know if you see any significant change in behaviour.
ID: 1585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1586 - Posted: 10 Jan 2016, 22:32:56 UTC - in response to Message 1585.  
Last modified: 10 Jan 2016, 22:39:53 UTC

I can see a failed job code 60311.
Until now, it disappeared fairly quickly.
Now, it should stay for 12h, right?

Some failed jobs do not get reissued after the 1st attempt.
Is that, because there is an unrecoverable error in the job itself?
ID: 1586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1587 - Posted: 10 Jan 2016, 22:53:28 UTC - in response to Message 1586.  

I can see a failed job code 60311.
Until now, it disappeared fairly quickly.
Now, it should stay for 12h, right?

Job 1824 on IP 75.117.107.58 -- Yes, I think so, and it might get picked up again by the original host; I can see if IPs match when it finally completes. Let's wait and see. I could try to force an instance myself on my Linux box tomorrow, but I'm going to be rather busy this week.

Some failed jobs do not get reissued after the 1st attempt.
Is that, because there is an unrecoverable error in the job itself?

Something like that, but I haven't been able to get a handle on it myself. Something to keep in mind next time I'm talking to an HT Condor expert.
ID: 1587 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1588 - Posted: 10 Jan 2016, 23:00:35 UTC

I can see a failed job code 60311.
Until now, it disappeared fairly quickly.
Now, it should stay for 12h, right?


Apparently not.It got reissued (to the same host?).
ID: 1588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,540
Message 1589 - Posted: 10 Jan 2016, 23:20:49 UTC - in response to Message 1588.  

I can see a failed job code 60311.
Until now, it disappeared fairly quickly.
Now, it should stay for 12h, right?

Apparently not.It got reissued (to the same host?).

Hmm, can't tell. Condor still has it with the dummy job_out.nnnn.0.txt, and node_state.txt has RetryCount = 0;.
ID: 1589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN