Job retries on same host

Author	Message
m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1285 - Posted: 22 Oct 2015, 10:28:00 UTC Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. ID: 1285 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1286 - Posted: 22 Oct 2015, 15:16:16 UTC - in response to Message 1285. Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. The thing here is that Condor doesn't know which host it's sending to. It maintains a queue or pool of jobs and when a glide-in job asks, "Give me a job", it sends the next in line. With only around 60 hosts taking jobs the chance that you'll pick up a job that you already failed on is pretty high. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. ID: 1286 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 82	Message 1287 - Posted: 22 Oct 2015, 16:55:09 UTC - in response to Message 1286. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. One of my rarely failed jobs. Reason not understandable for me. There was nothing special with the host machine. 3 seconds after the process starts it got a SIGTERM. 10/22/15 18:22:07 (pid:16640) Using wrapper /home/boinc/CMSRun/glide_BQwg9Z/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_BQwg9Z/execute/dir_16640/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1494 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1494.txt --runAndLumis=job_lumis_1494.json --lheInputFiles=False --firstEvent=37326 --firstLumi=1494 --lastEvent=37351 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 10/22/15 18:22:07 (pid:16640) Running job as user (null) 10/22/15 18:22:07 (pid:16640) Create_Process succeeded, pid=16644 10/22/15 18:22:10 (pid:16640) Got SIGTERM. Performing graceful shutdown. 10/22/15 18:22:10 (pid:16640) ShutdownGraceful all jobs. 10/22/15 18:22:10 (pid:16640) Process exited, pid=16644, signal=15 10/22/15 18:22:11 (pid:16640) Last process exited, now Starter is exiting 10/22/15 18:22:11 (pid:16640) ** condor_starter (condor_STARTER) pid 16640 EXITING WITH STATUS 0 10/22/15 18:22:12 (pid:16788) ************************************************** 10/22/15 18:22:12 (pid:16788) condor_starter (CONDOR_STARTER) STARTING UP 10/22/15 18:22:12 (pid:16788) /home/boinc/CMSRun/glide_BQwg9Z/main/condor/sbin/condor_starter 10/22/15 18:22:12 (pid:16788) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 10/22/15 18:22:12 (pid:16788) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 10/22/15 18:22:12 (pid:16788) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 10/22/15 18:22:12 (pid:16788) $CondorPlatform: x86_64_RedHat5 $ 10/22/15 18:22:12 (pid:16788) PID = 16788 10/22/15 18:22:12 (pid:16788) Log last touched 10/22 18:22:11 10/22/15 18:22:12 (pid:16788) **************************************************** 10/22/15 18:22:12 (pid:16788) Using config source: /home/boinc/CMSRun/glide_BQwg9Z/condor_config 10/22/15 18:22:12 (pid:16788) config Macros = 212, Sorted = 212, StringBytes = 10685, TablesBytes = 7672 10/22/15 18:22:12 (pid:16788) CLASSAD_CACHING is OFF 10/22/15 18:22:12 (pid:16788) Daemon Log is logging: D_ALWAYS D_ERROR 10/22/15 18:22:12 (pid:16788) DaemonCore: command socket at <10.0.2.15:37293?noUDP> 10/22/15 18:22:12 (pid:16788) DaemonCore: private command socket at <10.0.2.15:37293> 10/22/15 18:22:12 (pid:16788) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#103316 10/22/15 18:22:12 (pid:16788) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_67024> 10/22/15 18:22:12 (pid:16788) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 10/22/15 18:22:12 (pid:16788) setting the orig job name in starter 10/22/15 18:22:12 (pid:16788) setting the orig job iwd in starter 10/22/15 18:22:12 (pid:16788) Chirp config summary: IO false, Updates false, Delayed updates true. 10/22/15 18:22:12 (pid:16788) Initialized IO Proxy. 10/22/15 18:22:12 (pid:16788) Done setting resource limits 10/22/15 18:22:12 (pid:16788) FILETRANSFER: "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/22/15 18:22:12 (pid:16788) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/22/15 18:22:12 (pid:16788) Got SIGTERM. Performing graceful shutdown. 10/22/15 18:22:12 (pid:16788) ShutdownGraceful all jobs. 10/22/15 18:22:13 (pid:16788) ERROR "FileTransfer::UpLoadFiles called during active transfer! " at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp 10/22/15 18:22:13 (pid:16788) ShutdownFast all jobs. 10/22/15 18:22:14 (pid:16792) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 158485.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh). ID: 1287 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1289 - Posted: 22 Oct 2015, 18:55:59 UTC - in response to Message 1287. Page 16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf describes what that plugin should return. That there's no response suggests that it's not there, didn't get (down?)loaded properly, so we might need to look for clues further up the log. ID: 1289 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1290 - Posted: 22 Oct 2015, 19:04:58 UTC - in response to Message 1286. Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right. The thing here is that Condor doesn't know which host it's sending to. It maintains a queue or pool of jobs and when a glide-in job asks, "Give me a job", it sends the next in line. With only around 60 hosts taking jobs the chance that you'll pick up a job that you already failed on is pretty high. OK, got it, thanks. It just seems that with a pool of "workers" that is, or can be seen as, less reliable than the norm, some means of reducing the effect of "rogues" would be a good idea. Don't know what other VM projects do. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM. Presumably these failures aren't the result of host failures. There are a lot of them ID: 1290 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1292 - Posted: 22 Oct 2015, 20:13:34 UTC - in response to Message 1290. Presumably these failures aren't the result of host failures. There are a lot of them Tja... A lot of those stage-out failures are due to one user but, to be fair, he's had a number of successes as well. There are three machines hiding behind that IP, we can't differentiate them yet. I've PM-ed him, if there's no response by tomorrow I'll try his e-mail instead. The "Unknown"s are a bit more problematic. We have copious logs, but not the manpower nor overall expertise to go through them all. For the moment, Condor retries are keeping our overall failures low but at the expense of some metric of "efficiency". ID: 1292 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,133,398 RAC: 1,621	Message 1298 - Posted: 22 Oct 2015, 22:20:06 UTC - in response to Message 1292. [checks Inbox, phew, not me] Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens. Apart from the odd job that 'disconnects' and leaves the percentage completed unchanged as far as I can tell the jobs work (when there is work to do). Is there any easy way after the 24 hour job has completed of knowing if everything went ok, if not, are there any plans to enable that ? Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary. ID: 1298 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1299 - Posted: 22 Oct 2015, 23:07:36 UTC - in response to Message 1298. [Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens. Apart from the odd job that 'disconnects' and leaves the percentage completed unchanged as far as I can tell the jobs work (when there is work to do). Is there any easy way after the 24 hour job has completed of knowing if everything went ok, if not, are there any plans to enable that ? Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary. Natuerlich! That's why we are trying to get as many bugs out as possible before we start to put on a hi-vis jacket and shout, "Hey! I'm over here! Come and join me!" ID: 1299 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,133,398 RAC: 1,621	Message 1300 - Posted: 22 Oct 2015, 23:20:02 UTC - in response to Message 1299. Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. I was hoping for something a bit easier than that ! ID: 1300 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1301 - Posted: 22 Oct 2015, 23:23:53 UTC - in response to Message 1300. Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. I was hoping for something a bit easier than that ! Me too? ID: 1301 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1302 - Posted: 22 Oct 2015, 23:38:01 UTC - in response to Message 1292. Last modified: 22 Oct 2015, 23:53:46 UTC Presumably these failures aren't the result of host failures. There are a lot of them Tja... A lot of those stage-out failures are due to one user but, to be fair, he's had a number of successes as well. There are three machines hiding behind that IP, we can't differentiate them yet. I've PM-ed him, if there's no response by tomorrow I'll try his e-mail instead. The "Unknown"s are a bit more problematic. We have copious logs, but not the manpower nor overall expertise to go through them all. For the moment, Condor retries are keeping our overall failures low but at the expense of some metric of "efficiency". Sooner or later we're going to need to know what host problem, (if any) can cause these stage-out failures. Given that the actual application completes OK, error free transfer of the result is all that's left for the host to do. If that's the problem it should be visible to other projects. There will always be some errors (he writes looking at a 4.2% rate on the few CMS jobs done so far, gets 3.1% on vLHC. It would be nice if you could emulate the stuff they get out of CoPilot) and if you're getting a lower rate than Atlas maybe that's as good as it gets. How does the retry rate compare with that of Atlas? ID: 1302 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1303 - Posted: 23 Oct 2015, 0:04:42 UTC - in response to Message 1299. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. It might be better to use the host ID, people might not want any details of their machines made public. ID: 1303 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0	Message 1305 - Posted: 23 Oct 2015, 4:41:31 UTC - in response to Message 1303. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. It might be better to use the host ID, people might not want any details of their machines made public. Better would be the username I'm running more than 10 hosts and would prefer to search for the username ID: 1305 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1307 - Posted: 23 Oct 2015, 7:55:21 UTC - in response to Message 1305. Better would be the username I'm running more than 10 hosts and would prefer to search for the username I believe we do have that, from information in the emulated floppy, fd0. I'll ask Laurence whether there's a place in one of our scripts to set hostname. ID: 1307 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1315 - Posted: 23 Oct 2015, 13:07:16 UTC - in response to Message 1305. Last modified: 23 Oct 2015, 13:22:55 UTC If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts. It might be better to use the host ID, people might not want any details of their machines made public. Better would be the username I'm running more than 10 hosts and would prefer to search for the username It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best. Anyway, they appear in cron.stdout:- 00:07:03 +0100 2015-10-22 [INFO] Volunteer: m (178) Host: 553 so you (CMS) already know... don't you? ID: 1315 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,236,981 RAC: 3,114	Message 1316 - Posted: 23 Oct 2015, 16:02:39 UTC - in response to Message 1315. It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best. Anyway, they appear in cron.stdout:- 00:07:03 +0100 2015-10-22 [INFO] Volunteer: m (178) Host: 553 so you (CMS) already know... don't you? That information doesn't get returned to us as far as I can see. However, I noticed today that the hostname is set to (for me) 9-22-13376 -- 9 is my userid, 22 is my host, and I'm not sure what 13376 is. I'm trying to see if that information gets back to us... Yes, it does: RemoteHost = "glidein_3641@9-22-13376" ID: 1316 · Rating: 0 · rate: / Reply Quote

Development for LHC@home