Re: [Zope] Re: zope unresponsive

27 Feb 2007


      On 2/27/07, Paul Williams <pwilliams@diamonddata.com> wrote:
...
Tres Seaver wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Paul Williams wrote:
...
Ok, here is what we have.  I did a netstat on both machines, client and
server.  The client sees and established connection and the server does
not.  In the server log there is a disconnect.  As far as hardware
between them, there is a switch (dell powerconnect 6024).  Web Server
Directors might get hold of it but there are no hops on traceroute.
Traceroute only shows the client machine and the server machine.
So the client is just continuously polling the connection but getting
nothing back.
That sounds like some weird kernel / networking problem to me:  I don't
see how Zope could be able to keep calling 'select' on a socket after
the other side has closed it.
We agree.  This is a strange situation that none of us have seen before.
However, we have until tomorrow to do something and replacing hardware
is not feasable.
...
Is there any possibility that some kind of failover / IP takeover has
happened, such that the storage server now running is not the same host
/ instance as the one to shich the clients originally connected?  Are
you using LVS + heartbeat, or some kind of hardware load balancer to
manage such redundancy?
We do have Web Services Directors that do load balancing, but in this
particular case, the storage server is not setup for load balancing, I
am not aware of any features that make the zodb capable of clustering
except for replication services offered through zope.
We are not sure whether the traffic is going to the Web Services
Directores or not.  Even if it is, there are thousands of settings and
there is no-one available that knows what to change.
The storage server is a simple nas server with a static ip address.
...
...
What we are thinking about doing is changing the code in
zrpc/connection.py to close the connection in wait (line 638 zope
version 2.9.5) if the wait time gets too large or the poll has happened
too many times.
We are great at plone development, but have very little backend zope
development.  Would someone please advise me as to whether this is going
to cause more problems?
According to the log message you posted earlier in the thread, your
appservers are spewing thousands of log messages from the connection's
'pending' method, although your deadlock debugger output shows the one
thread blocked on 'select' inside of the connection's 'wait' method.
There should be lots of log messages at TRACE level for the wait call,
including a doubling / backoff of the delay value from 1 mx to 1 sec.
Do you see those log messages, as well?
These messages are there.  You can see the time doubling.  This is where
we were thinking of breaking the connection once it gets to a certain
point and make zope reconnect.
This solves our hung connection problem, we think.  However, I am hoping
someone can let me know if I am breaking something else by doing this.
I don't remember if you already mentioned it. However: did you tried
to monitor the traffic outgoing and incoming? I mean, setting some
iptables rules and/or using something like tcpdump to monitor what is
going on here?

Regards
Marco


-- 
Marco Bizzarri
http://iliveinpisa.blogspot.com/