/ Zope / Apsis / Pound Mailing List / Archive / 2009 / 2009-06 / Timeouts, backend kills and resurrects

[ << ] [ >> ]

[ Certificate problem with firefox (resolved) / ... ] [ undefined reference to rpl_malloc / "Jacob ... ]

Timeouts, backend kills and resurrects
Albert <pound(at)alacra.com>
2009-06-19 19:49:00 [ FULL ]
Over the past few months we've noticed that the killing and resurrection 
of backends was not done on time.  We also noticed that some backends 
were being killed, even though they were alive, but because of some 
network related hiccups (we're still investigating them), pound would 
kill the backend.

I've spend the last couple of days debugging pound code.  I found that 
there was not really any problems with pound code, but between our 
configuration and pound code, we were running into these problems.

First, let me quickly describe our configuration for relevant variables:
TimeOut 180 -- very high, but we have some HTTP requests which can take 
a while to complete
Alive 15

Some of the backends have override for TimeOut for 15 sec, but in 
general they should all be at 180 sec.  We also have HAPort for each of 
our backends (other than regular HTTP port).  We have a custom app which 
disables HAPort for a server, when we need to take backend offline or 
reset HTTP service, etc.  Otherwise, HAPort is always listening for 
pound, for Host-Alive checks.

We've run into a problem where one of our backends died, and stayed that 
way for a while.  This caused pound to run its resurrection code 
(do_resurect) every 3 minutes (our default TimeOut value).  We tracked 
it down to the part of the code where pound is trying to connect to the 
server in do_resurect(), and waits for 3 minutes before timing out.  As 
it waits, and since there is only 1 thread running do_resurect, the rest 
of the servers are not being checked every 15 seconds, as intended by 
"Alive" value.

The problem, as I see it, is a lack of a separate variable for "Connect 
TimeOut" vs "Time-Out for read/gets".  Currently, pound uses the same 
variable for both connecting and waiting on read/gets.  The "Connect 
TimeOut" can be an optional variable, with the default value of regular 
TimeOut.

We also have a related issue with the way pound kills backends when 
connect_nb fails to the regular "Port" of backend during an HTTP 
request.  As I mentioned above, we've seen network hiccups where connect 
calls time out, even though the backend is fine, and another connect at 
the same time goes through.  This has caused pound to kill the backend 
during an HTTP request, if connect fails (and this happens 3 minutes 
after initial call to connect_nb, during which time bunch of other 
requests have been completed).  I was wondering, in case where an HAPort 
exist, should pound kill a backend if HAPort says it alive? 

I believe in such setup (where HAPort is defined), when connect_nb 
inside thr_http fails, pound should either:
1. Do nothing with the backend(let do_resurect take the backend offline 
if its dead), and get the next backend from the list of available 
servers, or
2. Check HAPort to see if the backend is alive, and take appropriate 
action, or
3. Retry the connect_nb, and if fails again, take the backend offline, or
4. Track the failure, if reached some threshold value (i.e. 5 
consecutive failures), then take the backend offline.

The last one is a bit complicated, but would make sure the backend is 
eventually taken out of the pool if HAPort is still responding, but the 
HTTP service is not.  On the other hand, if HAPort exists, then its 
really responsibility of the application running HAPort to do such 
checks, and refuse connections on the HAPort if HTTP service is dead (so 
one of the first 2 options would make more sense)

Maybe there is a simpler and more elegant solution for this type of 
condition, but I believe it needs to be handled differently than it is 
right now. 

In summary, we'd like to see:
1. A separate ConnectTimeOut variable to be used on connects.  TimeOut 
would be used for read/gets, and also for connects if ConnectTimeOut is 
not defined.
2. Don't automatically kill a backend, inside thr_http, if connect_nb fails.

Albert
Attachments:  
text.html text/html 4270 Bytes

MailBoxer