Message boards : Number crunching : Job retries on same host
Message board moderation

To post messages, you must log in.

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 1285 - Posted: 22 Oct 2015, 10:28:00 UTC

Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right.
ID: 1285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1286 - Posted: 22 Oct 2015, 15:16:16 UTC - in response to Message 1285.  

Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right.

The thing here is that Condor doesn't know which host it's sending to. It maintains a queue or pool of jobs and when a glide-in job asks, "Give me a job", it sends the next in line. With only around 60 hosts taking jobs the chance that you'll pick up a job that you already failed on is pretty high. Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM.
ID: 1286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 1287 - Posted: 22 Oct 2015, 16:55:09 UTC - in response to Message 1286.  

Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM.


One of my rarely failed jobs. Reason not understandable for me.
There was nothing special with the host machine. 3 seconds after the process starts it got a SIGTERM.

10/22/15 18:22:07 (pid:16640) Using wrapper /home/boinc/CMSRun/glide_BQwg9Z/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_BQwg9Z/execute/dir_16640/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=1494 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_1494.txt --runAndLumis=job_lumis_1494.json --lheInputFiles=False --firstEvent=37326 --firstLumi=1494 --lastEvent=37351 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
10/22/15 18:22:07 (pid:16640) Running job as user (null)
10/22/15 18:22:07 (pid:16640) Create_Process succeeded, pid=16644
10/22/15 18:22:10 (pid:16640) Got SIGTERM. Performing graceful shutdown.

10/22/15 18:22:10 (pid:16640) ShutdownGraceful all jobs.
10/22/15 18:22:10 (pid:16640) Process exited, pid=16644, signal=15
10/22/15 18:22:11 (pid:16640) Last process exited, now Starter is exiting
10/22/15 18:22:11 (pid:16640) **** condor_starter (condor_STARTER) pid 16640 EXITING WITH STATUS 0
10/22/15 18:22:12 (pid:16788) ******************************************************
10/22/15 18:22:12 (pid:16788) ** condor_starter (CONDOR_STARTER) STARTING UP
10/22/15 18:22:12 (pid:16788) ** /home/boinc/CMSRun/glide_BQwg9Z/main/condor/sbin/condor_starter
10/22/15 18:22:12 (pid:16788) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
10/22/15 18:22:12 (pid:16788) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
10/22/15 18:22:12 (pid:16788) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
10/22/15 18:22:12 (pid:16788) ** $CondorPlatform: x86_64_RedHat5 $
10/22/15 18:22:12 (pid:16788) ** PID = 16788
10/22/15 18:22:12 (pid:16788) ** Log last touched 10/22 18:22:11
10/22/15 18:22:12 (pid:16788) ******************************************************
10/22/15 18:22:12 (pid:16788) Using config source: /home/boinc/CMSRun/glide_BQwg9Z/condor_config
10/22/15 18:22:12 (pid:16788) config Macros = 212, Sorted = 212, StringBytes = 10685, TablesBytes = 7672
10/22/15 18:22:12 (pid:16788) CLASSAD_CACHING is OFF
10/22/15 18:22:12 (pid:16788) Daemon Log is logging: D_ALWAYS D_ERROR
10/22/15 18:22:12 (pid:16788) DaemonCore: command socket at <10.0.2.15:37293?noUDP>
10/22/15 18:22:12 (pid:16788) DaemonCore: private command socket at <10.0.2.15:37293>
10/22/15 18:22:12 (pid:16788) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#103316
10/22/15 18:22:12 (pid:16788) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_67024>
10/22/15 18:22:12 (pid:16788) Submitting machine is "lcggwms02.gridpp.rl.ac.uk"
10/22/15 18:22:12 (pid:16788) setting the orig job name in starter
10/22/15 18:22:12 (pid:16788) setting the orig job iwd in starter
10/22/15 18:22:12 (pid:16788) Chirp config summary: IO false, Updates false, Delayed updates true.
10/22/15 18:22:12 (pid:16788) Initialized IO Proxy.
10/22/15 18:22:12 (pid:16788) Done setting resource limits
10/22/15 18:22:12 (pid:16788) FILETRANSFER: "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/22/15 18:22:12 (pid:16788) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_BQwg9Z/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/22/15 18:22:12 (pid:16788) Got SIGTERM. Performing graceful shutdown.
10/22/15 18:22:12 (pid:16788) ShutdownGraceful all jobs.
10/22/15 18:22:13 (pid:16788) ERROR "FileTransfer::UpLoadFiles called during active transfer!
" at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp
10/22/15 18:22:13 (pid:16788) ShutdownFast all jobs.
10/22/15 18:22:14 (pid:16792) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 158485.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh).
ID: 1287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1289 - Posted: 22 Oct 2015, 18:55:59 UTC - in response to Message 1287.  

Page 16 of http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/zmiller-cw2011-data-placement.pdf describes what that plugin should return. That there's no response suggests that it's not there, didn't get (down?)loaded properly, so we might need to look for clues further up the log.
ID: 1289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 1290 - Posted: 22 Oct 2015, 19:04:58 UTC - in response to Message 1286.  

Received job 412 for it's third try here. Dashboard (I know, but it's all we've got...) shows the first two tries with the same IP. Either this is another Dashboard Strike or Condor(?) can send multiple tries to the same host; that surely can't be right.

The thing here is that Condor doesn't know which host it's sending to. It maintains a queue or pool of jobs and when a glide-in job asks, "Give me a job", it sends the next in line. With only around 60 hosts taking jobs the chance that you'll pick up a job that you already failed on is pretty high.

OK, got it, thanks. It just seems that with a pool of "workers" that is, or can be seen as, less reliable than the norm, some means of reducing the effect of "rogues" would be a good idea. Don't know what other VM projects do.

Note these were jobs that ran to conclusion and failed in stage-out -- the experts will have to seek out the reasons for that. In fact that host/hosts is failing a lot of jobs, I'll send a PM.

Presumably these failures aren't the result of host failures. There are a lot of them
ID: 1290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1292 - Posted: 22 Oct 2015, 20:13:34 UTC - in response to Message 1290.  

Presumably these failures aren't the result of host failures. There are a lot of them

Tja... A lot of those stage-out failures are due to one user but, to be fair, he's had a number of successes as well. There are three machines hiding behind that IP, we can't differentiate them yet. I've PM-ed him, if there's no response by tomorrow I'll try his e-mail instead.
The "Unknown"s are a bit more problematic. We have copious logs, but not the manpower nor overall expertise to go through them all. For the moment, Condor retries are keeping our overall failures low but at the expense of some metric of "efficiency".
ID: 1292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,871,767
RAC: 16,520
Message 1298 - Posted: 22 Oct 2015, 22:20:06 UTC - in response to Message 1292.  

[checks Inbox, phew, not me]

Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens.

Apart from the odd job that 'disconnects' and leaves the percentage completed unchanged as far as I can tell the jobs work (when there is work to do). Is there any easy way after the 24 hour job has completed of knowing if everything went ok, if not, are there any plans to enable that ?

Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary.
ID: 1298 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1299 - Posted: 22 Oct 2015, 23:07:36 UTC - in response to Message 1298.  

[Just had a UPS fail and take 2 machines down, the one running a CMS job has restarted at 0% with 8+ hours elapsed already, will leave it running to see what happens.

Apart from the odd job that 'disconnects' and leaves the percentage completed unchanged as far as I can tell the jobs work (when there is work to do). Is there any easy way after the 24 hour job has completed of knowing if everything went ok, if not, are there any plans to enable that ?
Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.
Checking whilst the jobs run when testing things out now and again is okay but for long term monitoring isn't practical. I'm sure you don't want to turn into a job monitor sending emails out when necessary.
Natuerlich! That's why we are trying to get as many bugs out as possible before we start to put on a hi-vis jacket and shout, "Hey! I'm over here! Come and join me!"
ID: 1299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,871,767
RAC: 16,520
Message 1300 - Posted: 22 Oct 2015, 23:20:02 UTC - in response to Message 1299.  

Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.


I was hoping for something a bit easier than that !
ID: 1300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1301 - Posted: 22 Oct 2015, 23:23:53 UTC - in response to Message 1300.  

Not at present. If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.

I was hoping for something a bit easier than that !

Me too?
ID: 1301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 1302 - Posted: 22 Oct 2015, 23:38:01 UTC - in response to Message 1292.  
Last modified: 22 Oct 2015, 23:53:46 UTC

Presumably these failures aren't the result of host failures. There are a lot of them

Tja... A lot of those stage-out failures are due to one user but, to be fair, he's had a number of successes as well. There are three machines hiding behind that IP, we can't differentiate them yet. I've PM-ed him, if there's no response by tomorrow I'll try his e-mail instead.
The "Unknown"s are a bit more problematic. We have copious logs, but not the manpower nor overall expertise to go through them all. For the moment, Condor retries are keeping our overall failures low but at the expense of some metric of "efficiency".

Sooner or later we're going to need to know what host problem, (if any) can cause these stage-out failures. Given that the actual application completes OK, error free transfer of the result is all that's left for the host to do. If that's the problem it should be visible to other projects. There will always be some errors (he writes looking at a 4.2% rate on the few CMS jobs done so far, gets 3.1% on vLHC. It would be nice if you could emulate the stuff they get out of CoPilot) and if you're getting a lower rate than Atlas maybe that's as good as it gets. How does the retry rate compare with that of Atlas?
ID: 1302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 1303 - Posted: 23 Oct 2015, 0:04:42 UTC - in response to Message 1299.  

If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.

It might be better to use the host ID, people might not want any details of their machines made public.
ID: 1303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1305 - Posted: 23 Oct 2015, 4:41:31 UTC - in response to Message 1303.  

If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.

It might be better to use the host ID, people might not want any details of their machines made public.

Better would be the username

I'm running more than 10 hosts and would prefer to search for the username
ID: 1305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1307 - Posted: 23 Oct 2015, 7:55:21 UTC - in response to Message 1305.  

Better would be the username
I'm running more than 10 hosts and would prefer to search for the username
I believe we do have that, from information in the emulated floppy, fd0. I'll ask Laurence whether there's a place in one of our scripts to set hostname.
ID: 1307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 1315 - Posted: 23 Oct 2015, 13:07:16 UTC - in response to Message 1305.  
Last modified: 23 Oct 2015, 13:22:55 UTC

If we can work out a way to get the host's IP into the VM's hostname (instead of localhost.localdomain) then you could use the procedure I posted earlier tonight to search for your hosts.

It might be better to use the host ID, people might not want any details of their machines made public.

Better would be the username

I'm running more than 10 hosts and would prefer to search for the username


It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best.

Anyway, they appear in cron.stdout:-

00:07:03 +0100 2015-10-22 [INFO] Volunteer: m (178) Host: 553

so you (CMS) already know... don't you?
ID: 1315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1316 - Posted: 23 Oct 2015, 16:02:39 UTC - in response to Message 1315.  

It's the actual machine that I'd like to find. Either host ID or local IP would do but host ID would be best.

Anyway, they appear in cron.stdout:-

00:07:03 +0100 2015-10-22 [INFO] Volunteer: m (178) Host: 553

so you (CMS) already know... don't you?

That information doesn't get returned to us as far as I can see. However, I noticed today that the hostname is set to (for me) 9-22-13376 -- 9 is my userid, 22 is my host, and I'm not sure what 13376 is. I'm trying to see if that information gets back to us...
Yes, it does: RemoteHost = "glidein_3641@9-22-13376"
ID: 1316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Job retries on same host


©2024 CERN