Message boards : Number crunching : Job retries on same host
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. The thing here is that Condor doesn't know which host it's sending to. It maintains a queue or pool of jobs and when a glide-in job asks, "Give me a job", it sends the next in line. With only around 60 hosts taking jobs the chance that you'll pick up a job that you already failed on is pretty high. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. ![]() |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 908,362 RAC: 1,393 ![]() ![]() ![]() |
Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. One of my rarely failed jobs. Reason not understandable for me. There was nothing special with the host machine. 3 seconds after the process starts it got a SIGTERM. 10/22/15 18:22:07 (pid:16640) Using wrapper /home/boinc/CMSRun/glide_BQwg9Z/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_BQwg9Z/execute/dir_16640/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1494 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1494.txt --runAndLumis=job_lumis_1494.json --lheInputFiles=False --firstEvent=37326 --firstLumi=1494 --lastEvent=37351 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 10/22/15 18:22:07 (pid:16640) Running job as user (null) 10/22/15 18:22:07 (pid:16640) Create_Process succeeded, pid=16644 10/22/15 18:22:10 (pid:16640) Got SIGTERM. Performing graceful shutdown. 10/22/15 18:22:10 (pid:16640) ShutdownGraceful all jobs. 10/22/15 18:22:10 (pid:16640) Process exited, pid=16644, signal=15 10/22/15 18:22:11 (pid:16640) Last process exited, now Starter is exiting 10/22/15 18:22:11 (pid:16640) **** condor_starter (condor_STARTER) pid 16640 EXITING WITH STATUS 0 10/22/15 18:22:12 (pid:16788) ****************************************************** 10/22/15 18:22:12 (pid:16788) ** condor_starter (CONDOR_STARTER) STARTING UP 10/22/15 18:22:12 (pid:16788) ** /home/boinc/CMSRun/glide_BQwg9Z/main/condor/sbin/condor_starter 10/22/15 18:22:12 (pid:16788) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 10/22/15 18:22:12 (pid:16788) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 10/22/15 18:22:12 (pid:16788) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 10/22/15 18:22:12 (pid:16788) ** $CondorPlatform: x86_64_RedHat5 $ 10/22/15 18:22:12 (pid:16788) ** PID = 16788 10/22/15 18:22:12 (pid:16788) ** Log last touched 10/22 18:22:11 10/22/15 18:22:12 (pid:16788) ****************************************************** 10/22/15 18:22:12 (pid:16788) Using config source: /home/boinc/CMSRun/glide_BQwg9Z/condor_config 10/22/15 18:22:12 (pid:16788) config Macros = 212, Sorted = 212, StringBytes = 10685, TablesBytes = 7672 10/22/15 18:22:12 (pid:16788) CLASSAD_CACHING is OFF 10/22/15 18:22:12 (pid:16788) Daemon Log is logging: D_ALWAYS D_ERROR 10/22/15 18:22:12 (pid:16788) DaemonCore: command socket at <10.0.2.15:37293?noUDP> 10/22/15 18:22:12 (pid:16788) DaemonCore: private command socket at <10.0.2.15:37293> 10/22/15 18:22:12 (pid:16788) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#103316 10/22/15 18:22:12 (pid:16788) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_67024> 10/22/15 18:22:12 (pid:16788) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 10/22/15 18:22:12 (pid:16788) setting the orig job name in starter 10/22/15 18:22:12 (pid:16788) setting the orig job iwd in starter 10/22/15 18:22:12 (pid:16788) Chirp config summary: IO false, Updates false, Delayed updates true. 10/22/15 18:22:12 (pid:16788) Initialized IO Proxy. 10/22/15 18:22:12 (pid:16788) Done setting resource limits 10/22/15 18:22:12 (pid:16788) FILETRANSFER: "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/22/15 18:22:12 (pid:16788) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/22/15 18:22:12 (pid:16788) Got SIGTERM. Performing graceful shutdown. 10/22/15 18:22:12 (pid:16788) ShutdownGraceful all jobs. 10/22/15 18:22:13 (pid:16788) ERROR "FileTransfer::UpLoadFiles called during active transfer! " at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp 10/22/15 18:22:13 (pid:16788) ShutdownFast all jobs. 10/22/15 18:22:14 (pid:16792) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 158485.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh). |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Page 16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf describes what that plugin should return. That there's no response suggests that it's not there, didn't get (down?)loaded properly, so we might need to look for clues further up the log. ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. OK, got it, thanks. It just seems that with a pool of "workers" that is, or can be seen as, less reliable than the norm, some means of reducing the effect of "rogues" would be a good idea. Don't know what other VM projects do. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. Presumably these failures aren't the result of host failures. There are a lot of them |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Presumably these failures aren't the result of host failures. There are a lot of them Tja... A lot of those stage-out failures are due to one user but, to be fair, he's had a number of successes as well. There are three machines hiding behind that IP, we can't differentiate them yet. I've PM-ed him, if there's no response by tomorrow I'll try his e-mail instead. The "Unknown"s are a bit more problematic. We have copious logs, but not the manpower nor overall expertise to go through them all. For the moment, Condor retries are keeping our overall failures low but at the expense of some metric of "efficiency". ![]() |
![]() Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0 ![]() ![]() |
[checks Inbox, phew, not me] Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens. Apart from the odd job that 'disconnects' and leaves the percentage completed unchanged as far as I can tell the jobs work (when there is work to do). Is there any easy way after the 24 hour job has completed of knowing if everything went ok, if not, are there any plans to enable that ? Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
[Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens.Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary.Natuerlich! That's why we are trying to get as many bugs out as possible before we start to put on a hi-vis jacket and shout, "Hey! I'm over here! Come and join me!" ![]() |
![]() Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0 ![]() ![]() |
Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. I was hoping for something a bit easier than that ! |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. Me too? ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Presumably these failures aren't the result of host failures. There are a lot of them Sooner or later we're going to need to know what host problem, (if any) can cause these stage-out failures. Given that the actual application completes OK, error free transfer of the result is all that's left for the host to do. If that's the problem it should be visible to other projects. There will always be some errors (he writes looking at a 4.2% rate on the few CMS jobs done so far, gets 3.1% on vLHC. It would be nice if you could emulate the stuff they get out of CoPilot) and if you're getting a lower rate than Atlas maybe that's as good as it gets. How does the retry rate compare with that of Atlas? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. It might be better to use the host ID, people might not want any details of their machines made public. |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. Better would be the username I'm running more than 10 hosts and would prefer to search for the username |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Better would be the usernameI believe we do have that, from information in the emulated floppy, fd0. I'll ask Laurence whether there's a place in one of our scripts to set hostname. ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best. Anyway, they appear in cron.stdout:- 00:07:03 +0100 2015-10-22 [INFO] Volunteer: m (178) Host: 553 so you (CMS) already know... don't you? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best. That information doesn't get returned to us as far as I can see. However, I noticed today that the hostname is set to (for me) 9-22-13376 -- 9 is my userid, 22 is my host, and I'm not sure what 13376 is. I'm trying to see if that information gets back to us... Yes, it does: RemoteHost = "glidein_3641@9-22-13376" ![]() |
©2025 CERN