/ Zope / Apsis / Pound Mailing List / Archive / 2004 / 2004-07 / bogus BackEnd dead and resurrect messages

[ << ] [ >> ]

[ Problem setting up reverse proxy / "Patrick ... ] [ RH9 install problem / "Dean Maunder" ... ]

bogus BackEnd dead and resurrect messages
spoke ma <lstest1(at)yahoo.com>
2004-07-11 16:12:29 [ FULL ]
Hello,

First of, I'd like to thank Robert and everyone who
helped out.  Pound is a great product - I'm using it
to turn my LVS load balancer do double duty as an ssl
accelerator for http://www.lightspoke.com - a free
online database service that delivers web-based
business applications in 60 seconds.

Now here is my problem - I'm getting periodic dead /
resurrect messages from pound.  If I check every 30
seonds, I might get a dead/resurrect message every 5 -
15 minutes.  Once this message appears, all user
sessions on the dead/resurrect server is broken bc
pound figures the server is dead and re-routes them to
another server which has no session-state for these 
users.  

So as a stop gap measure, I've turned off checking
completely via 
CheckURL 0
and this seems to stop the dead/resurrect messages

Unlike Brook Stevens in his  april post about the same
problem, I do not have any long running processes - so
this shouldn't be a timeout issue.  

I've tried to
1. use a different HA-PORT option to do the checking
2. try larger alive periods, 
3. increase timeouts with the Server and Client
parameters.  
None seem to work.  With longer checking periods, I go
longer without a dead/resurrect but then eventually, I
hit one and lose session state.

Any help you can give would be greatly appreciated.

Thank you in advance!



I'm running RedHat 9 kernel v2.4
I should note that my kernel was upgraded with the
ultramonkey (LVS) from 20-8 -> 30.9
[root(at)hualian store]# rpm -qa|grep kernel
kernel-pcmcia-cs-3.1.31-13
kernel-source-2.4.20-8
kernel-2.4.20-30.9.um.2



Here is a config file:

User xx
Group xxx
RootJail /xxxxxx/xxxxx
ListenHTTP xxx.xxx.xxx.xxx,81
ListenHTTPS xxx.xxx.xxx.xxx,443 /xxxxx/xxxx/xx.pem

UrlGroup ".*"
BackEnd 192.168.57.80,8080,5
BackEnd 192.168.57.81,8080,5
BackEnd 192.168.57.82,8080,5
Session IP 100800
EndGroup

LogLevel 0
#Alive 18000
Server 300
Client 30
CheckURL 0




Here is my log:

Jun 15 18:10:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:14:44 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:14:44 hualian pound: BackEnd 192.168.57.80
resurrect
Jun 15 18:15:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:20:00 hualian pound: error copy response
body: Broken pipe
Jun 15 18:25:00 hualian pound: error copy response
body: Broken pipe
Jun 15 18:28:14 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:28:14 hualian pound: BackEnd 192.168.57.80
resurrect
Jun 15 18:30:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:35:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:40:01 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:45:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:46:36 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:46:36 hualian pound: BackEnd 192.168.57.80
resurrect
Jun 15 18:48:06 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:48:06 hualian pound: BackEnd 192.168.57.80
resurrect
Jun 15 18:50:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:53:27 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:53:27 hualian pound: BackEnd 192.168.57.80
resurrect
Jun 15 18:55:00 hualian pound: error flush to
216.127.43.100: Broken pipe
Jun 15 18:55:57 hualian pound: BackEnd 192.168.57.80
is dead
Jun 15 18:55:57 hualian pound: BackEnd 192.168.57.80
resurrect

--
Matthew Ma
<a href=www.lightspoke.com> Lightspoke </a> == Web
Apps in 60 Seconds



		[...]

Re: bogus BackEnd dead and resurrect messages
Robert Segall <roseg(at)apsis.ch>
2004-07-12 14:20:38 [ FULL ]
On Sunday 11 July 2004 16.12, spoke ma wrote:[...]

CheckURL has nothing to do with dead back-ends. If you enable CheckURL then 
some requests will not go through at all. This may have the side-effect of 
reducing the load on your back-ends, and thus avoiding a time-out.
[...]

The HA_PORT number is of no consequence.
[...]

Actually you want shorter periods, so dead back-ends come back online faster.
[...]

Increase the Server, possibly decrease the Client.
[...]

If you already have Server 300 the problem is NOT with a back-end not 
responding, but most likely that a new connection cannot be opened to a 
back-end.
[...]

Interestingly enough, only 192.168.57.80 seems to have the problem - the other 
back-ends are OK. I suggest two things to try:

- decrease the priority for this server (e.g. BackEnd 192.168.57.80,8080,1), 
so it will take less load.

- check the system configuration for possible resource starvation (number of 
allowed sockets, file descriptors, network time-outs, etc).[...]

Re: bogus BackEnd dead and resurrect messages
Robert Segall <roseg(at)apsis.ch>
2004-07-12 18:15:20 [ FULL ]
On Monday 12 July 2004 17.58, you wrote:[...]

No - it just leaves Alive at the default value (30 seconds). Alive is the time 
interval at which to check for the server coming back on-line. A server is 
declared dead if a connect(), read() or write() to it fail (possibly because 
of a time-out).
[...]

Are you by any chance running 1.7? There were some known issues with it - make 
sure you run -current.
[...]

ldirectord checks that the machine is up, not the reaction time for a Web 
server. Not quite the same thing.

I would still check on resource availability on the back-ends - stuff like 
sockets and network buffers.
[...]

That is typical of resource starvation: Pound got a time-out from the 
back-end, declared the back-end to be dead, no further requests were 
forwarded and the next resurrection check therefore succeeded. In a 
high-traffic situation a fraction of a second may make a significant 
difference.
[...]

That is usually an indication of a (too) slow response time - thus the server 
is considered to be dead.
[...]

This doesn't mean much - it just records the requests that were answered 
successfully.
[...]

If you use 1.7 - upgrade to -current and check again.

If you use -current - check on the DNS resolution (slow name resolution may 
cause problems) and time-outs on connect(). I would also check on the Pound 
machine itself, as a timeout may occur due to resource starvation there as 
well (packet queues, thread starvation).

Please keep this on the list - I am sure others are interested in it as
well.[...]

MailBoxer