Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Here is one: Job 3756 Time, about 7.15 UTC (8.15 local time) Another job was started successfully about 2 minutes later. cmsRun-stderr.log: Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Here is starter.log: 01/05/16 05:20:19 (pid:3602) Create_Process succeeded, pid=3608 01/05/16 08:14:54 (pid:3602) Process exited, pid=3608, status=0 01/05/16 08:14:56 (pid:3602) Got SIGQUIT. Performing fast shutdown. 01/05/16 08:14:56 (pid:3602) ShutdownFast all jobs. 01/05/16 08:14:56 (pid:3602) **** condor_starter (condor_STARTER) pid 3602 EXITING WITH STATUS 0 01/05/16 08:14:57 (pid:11108) ****************************************************** 01/05/16 08:14:57 (pid:11108) ** condor_starter (CONDOR_STARTER) STARTING UP 01/05/16 08:14:57 (pid:11108) ** /home/boinc/CMSRun/glide_IZnvNB/main/condor/sbin/condor_starter 01/05/16 08:14:57 (pid:11108) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 01/05/16 08:14:57 (pid:11108) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 01/05/16 08:14:57 (pid:11108) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 01/05/16 08:14:57 (pid:11108) ** $CondorPlatform: x86_64_RedHat5 $ 01/05/16 08:14:57 (pid:11108) ** PID = 11108 01/05/16 08:14:57 (pid:11108) ** Log last touched 1/5 08:14:56 01/05/16 08:14:57 (pid:11108) ****************************************************** 01/05/16 08:14:57 (pid:11108) Using config source: /home/boinc/CMSRun/glide_IZnvNB/condor_config 01/05/16 08:14:57 (pid:11108) config Macros = 215, Sorted = 215, StringBytes = 10700, TablesBytes = 7780 01/05/16 08:14:57 (pid:11108) CLASSAD_CACHING is OFF 01/05/16 08:14:57 (pid:11108) Daemon Log is logging: D_ALWAYS D_ERROR 01/05/16 08:14:57 (pid:11108) DaemonCore: command socket at <10.0.2.15:46929?noUDP> 01/05/16 08:14:57 (pid:11108) DaemonCore: private command socket at <10.0.2.15:46929> 01/05/16 08:14:58 (pid:11108) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#83347 01/05/16 08:14:58 (pid:11108) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2048_6ed1_78246> 01/05/16 08:14:58 (pid:11108) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 01/05/16 08:14:58 (pid:11108) setting the orig job name in starter 01/05/16 08:14:58 (pid:11108) setting the orig job iwd in starter 01/05/16 08:14:58 (pid:11108) Chirp config summary: IO false, Updates false, Delayed updates true. 01/05/16 08:14:58 (pid:11108) Initialized IO Proxy. 01/05/16 08:14:58 (pid:11108) Done setting resource limits 01/05/16 08:14:58 (pid:11108) FILETRANSFER: "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 01/05/16 08:14:58 (pid:11108) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 01/05/16 08:15:02 (pid:11108) File transfer completed successfully. 01/05/16 08:15:02 (pid:11108) Job 966446.0 set to execute immediately 01/05/16 08:15:02 (pid:11108) Starting a VANILLA universe job with ID: 966446.0 01/05/16 08:15:02 (pid:11108) IWD: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108 01/05/16 08:15:02 (pid:11108) Output file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/_condor_stdout 01/05/16 08:15:02 (pid:11108) Error file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/_condor_stderr 01/05/16 08:15:02 (pid:11108) Renice expr "0" evaluated to 0 01/05/16 08:15:02 (pid:11108) Using wrapper /home/boinc/CMSRun/glide_IZnvNB/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11108/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3756 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3756.txt --runAndLumis=job_lumis_3756.json --lheInputFiles=False --firstEvent=1126501 --firstLumi=11266 --lastEvent=1126801 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 01/05/16 08:15:02 (pid:11108) Running job as user (null) 01/05/16 08:15:02 (pid:11108) Create_Process succeeded, pid=11133 01/05/16 08:15:04 (pid:11108) Got SIGTERM. Performing graceful shutdown. 01/05/16 08:15:04 (pid:11108) ShutdownGraceful all jobs. 01/05/16 08:17:04 (pid:11108) ShutdownFast all jobs. 01/05/16 08:17:04 (pid:11108) Process exited, pid=11133, signal=9 01/05/16 08:17:04 (pid:11108) Last process exited, now Starter is exiting 01/05/16 08:17:04 (pid:11108) **** condor_starter (condor_STARTER) pid 11108 EXITING WITH STATUS 0 01/05/16 08:17:06 (pid:11516) ****************************************************** 01/05/16 08:17:06 (pid:11516) ** condor_starter (CONDOR_STARTER) STARTING UP 01/05/16 08:17:06 (pid:11516) ** /home/boinc/CMSRun/glide_IZnvNB/main/condor/sbin/condor_starter 01/05/16 08:17:06 (pid:11516) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 01/05/16 08:17:06 (pid:11516) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 01/05/16 08:17:06 (pid:11516) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 01/05/16 08:17:06 (pid:11516) ** $CondorPlatform: x86_64_RedHat5 $ 01/05/16 08:17:06 (pid:11516) ** PID = 11516 01/05/16 08:17:06 (pid:11516) ** Log last touched 1/5 08:17:04 01/05/16 08:17:06 (pid:11516) ****************************************************** 01/05/16 08:17:06 (pid:11516) Using config source: /home/boinc/CMSRun/glide_IZnvNB/condor_config 01/05/16 08:17:06 (pid:11516) config Macros = 215, Sorted = 215, StringBytes = 10700, TablesBytes = 7780 01/05/16 08:17:06 (pid:11516) CLASSAD_CACHING is OFF 01/05/16 08:17:06 (pid:11516) Daemon Log is logging: D_ALWAYS D_ERROR 01/05/16 08:17:06 (pid:11516) DaemonCore: command socket at <10.0.2.15:38509?noUDP> 01/05/16 08:17:06 (pid:11516) DaemonCore: private command socket at <10.0.2.15:38509> 01/05/16 08:17:06 (pid:11516) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#83348 01/05/16 08:17:06 (pid:11516) Communicating with shadow <130.246.180.120:9818?noUDP&sock=2048_6ed1_78249> 01/05/16 08:17:06 (pid:11516) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 01/05/16 08:17:06 (pid:11516) setting the orig job name in starter 01/05/16 08:17:06 (pid:11516) setting the orig job iwd in starter 01/05/16 08:17:06 (pid:11516) Chirp config summary: IO false, Updates false, Delayed updates true. 01/05/16 08:17:06 (pid:11516) Initialized IO Proxy. 01/05/16 08:17:06 (pid:11516) Done setting resource limits 01/05/16 08:17:06 (pid:11516) FILETRANSFER: "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 01/05/16 08:17:06 (pid:11516) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_IZnvNB/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 01/05/16 08:17:12 (pid:11516) File transfer completed successfully. 01/05/16 08:17:12 (pid:11516) Job 966449.0 set to execute immediately 01/05/16 08:17:12 (pid:11516) Starting a VANILLA universe job with ID: 966449.0 01/05/16 08:17:12 (pid:11516) IWD: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516 01/05/16 08:17:12 (pid:11516) Output file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/_condor_stdout 01/05/16 08:17:12 (pid:11516) Error file: /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/_condor_stderr 01/05/16 08:17:12 (pid:11516) Renice expr "0" evaluated to 0 01/05/16 08:17:12 (pid:11516) Using wrapper /home/boinc/CMSRun/glide_IZnvNB/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_IZnvNB/execute/dir_11516/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3761 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3761.txt --runAndLumis=job_lumis_3761.json --lheInputFiles=False --firstEvent=1128001 --firstLumi=11281 --lastEvent=1128301 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 01/05/16 08:17:12 (pid:11516) Running job as user (null) 01/05/16 08:17:12 (pid:11516) Create_Process succeeded, pid=11523 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yet another "runaway". Exit code 8022, 5+ fails in a row. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
Yet another "runaway". It's getting critical that we find a way of automatically detecting and notifying these hosts. This looks like another one that was switched off for a few days. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is there a way, that the server blocks a host after ,lets say, 3 consecutive fails and causes the cms boinc task to error out?(or send a pm?) I know, that is not very nice, but i have no idea, what the capabilities of the servers are. At least, this way, the volunteer will know, that something went wrong on a boinc level. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
We're having a Cloud Services meeting today, I'm just now preparing slides. I hope I can remember all the problems of the last 4 weeks, but I'm putting in a link to this thread. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had a job(1962) fail after 3 minutes. cmsrun stdout: Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py The the server lists:" Error returned without specification-Post-processing step failed". How can it be "post processing",if the job has not even started? However the end of condorstdout reads: ==== CMSSW JOB Execution started at Thu Jan 7 16:40:35 2016 ==== 2016-01-07 16:40:40,172:INFO:CMSSW:User files are 2016-01-07 16:40:40,172:INFO:CMSSW:User sandboxes are sandbox.tar.gz 2016-01-07 16:40:40,173:INFO:CMSSW:CMSSW configured for 1 cores 2016-01-07 16:40:40,173:INFO:CMSSW:Executing CMSSW step 2016-01-07 16:40:40,173:INFO:CMSSW:Runing SCRAM 2016-01-07 16:40:43,599:INFO:CMSSW:Running PRE scripts 2016-01-07 16:40:43,599:INFO:CMSSW:RUNNING SCRAM SCRIPTS 2016-01-07 16:40:43,600:INFO:CMSSW: Invoking command: python /home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/TweakPSet.py --location=/home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785 --inputFile='job_input_file_list_1962.txt' --runAndLumis='job_lumis_1962.json' --firstEvent=588301 --lastEvent=588601 --firstLumi=5884 --firstRun=1 --seeding=AutomaticSeeding --lheInputFiles=False --oneEventMode=0 --eventsPerLumi=100 2016-01-07 16:40:48,747:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/cmsRun-main.sh', '', 'slc6_amd64_gcc472', 'scramv1', 'CMSSW', 'CMSSW_6_2_0_SLHC26_patch3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', 'sandbox.tar.gz', '', ''] /home/boinc/CMSRun/glide_XdDkEm/execute/dir_11785/condor_exec.exe: line 12: 12033 Killed sh ./CMSRunAnalysis.sh "$@" --oneEventMode=$CRAB_oneEventMode ./CMSRunAnalysis.sh: line 159: 12025 Killed python CMSRunAnalysis.py -r "`pwd`" "$@" + jobrc=137 + set +x == The job had an exit code of 137 This time it has an exit code, that was not reported to the server. If you need more info, please tell me. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have a job (1711), which continues to run. However, Dashboard lists it as having failed (60311). Why does it keep running? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
I have a job (1711), which continues to run. I don't have any logs back from it yet -- it may be in post-proc. Dashboard reckons it finished ~90 mins ago. Can you post the last 8 or 10 lines from the cmsRun_stdout file? Is it using any CPU, can you tell? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
Well, Dashboard was wrong! It's corrected itself now though, 1711 is now in "finshed" state. The output has reached the Data-bridge and the last lines of my log file are: ====== Sun Jan 10 17:40:18 2016: cmscp.py FINISHING (status 0). ======== Stageout at Sun Jan 10 17:40:18 GMT 2016 FINISHING (short status 0) ======== ======== gWMS-CMSRunAnalysis.sh FINISHING at Sun Jan 10 17:40:18 GMT 2016 on 277-617-19427 with (short) status 0 ======== Local time: Sun Jan 10 18:40:18 CET 2016 Short exit status: 0 Job Running time in seconds: 6012 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It just finished normally. No indication, that anything went wrong. The IP address listed is not mine, but i had this before without any problem. I find it rather strange, that a job, that was not finished seem to be rerun by another machine,without starting a new attempt. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
It just finished normally. This turns out to be a known problem that has been seen by other CMS production teams, and is under active investigation -- well, at least the retention of the original IP through Condor retries (I got some good feedback from the two presentations I gave to the Collaboration last week; I've got another this week but that's just to the UK members so there won't be much technical feedback there). I will have to do some digging around to investigate why Condor sometimes "loses" a job and reassigns it well within the 12-hour time-out I have set (the default was 20 minutes, perhaps I'm setting the wrong parameter). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the Info, Ivan. There are 10 out of 48 results, that did not have my IP address. So roughly 20%. All finished properly. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
Thanks for the Info, Ivan. Good. Turns out that I had the wrong impression of how Condor works with its retries. I was under the impression that all job errors led to a retry, up to a maximum of three tries, so I would see a dummy log file job_out.nnnn.0.txt until the job ran, when it would be replaced by the actual log. If there was a failure then another dummy job_out.nnnn.1.txt would be created, to be replaced by the actual log, and futher to another failure dummy job_out.nnnn.2.txt would be created, replaced by the actual log again, but in this case no further retries occur after a failure. However, if there's a time-out no log file is returned and Condor just requeues the job and when output finally arrives it overwrites the current dummy for that job. So, we could have jobs time-out several times, e.g. when the Volunteers switches his host off for too long, or if there's a job running when the 24-hour task limit runs out, and these aren't seen (in my logs at least), but some of the info does get relayed to Dashboard -- leading to confusion... Now, see my next post for a partial easing of this problem! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
I will have to do some digging around to investigate why Condor sometimes "loses" a job and reassigns it well within the 12-hour time-out I have set (the default was 20 minutes, perhaps I'm setting the wrong parameter). Oops! mea culpa it would seem. I've been setting a Condor parameter LeaseDuration to 12 hours, to extend the time before Condor thinks a job is lost. But I got it wrong, it should be JobLeaseDuration! Seems Condor lets you set anything (much like an environment variable in Linux or Windows) without it necessarily being a valid name... I'll have to go back to my email correspondence with Andrew about six months ago to see where I got it wrong, could just be my failing eyesight. (The default was two hours, by the way, so my memory is failing too...) So, let's see what effect correcting that has on job efficiencies, wrong IP attributions, etc. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Please let us know, when you change the parameter. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
Please let us know, when you change the parameter. I have changed it already (2100 GMT); let us know if you see any significant change in behaviour. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I can see a failed job code 60311. Until now, it disappeared fairly quickly. Now, it should stay for 12h, right? Some failed jobs do not get reissued after the 1st attempt. Is that, because there is an unrecoverable error in the job itself? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
I can see a failed job code 60311. Job 1824 on IP 75.117.107.58 -- Yes, I think so, and it might get picked up again by the original host; I can see if IPs match when it finally completes. Let's wait and see. I could try to force an instance myself on my Linux box tomorrow, but I'm going to be rather busy this week. Some failed jobs do not get reissued after the 1st attempt. Something like that, but I haven't been able to get a handle on it myself. Something to keep in mind next time I'm talking to an HT Condor expert. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I can see a failed job code 60311. Apparently not.It got reissued (to the same host?). |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,540 |
I can see a failed job code 60311. Hmm, can't tell. Condor still has it with the dummy job_out.nnnn.0.txt, and node_state.txt has RetryCount = 0;. |
©2024 CERN