[Zope] Re: zope unresponsive

Tue Feb 27 09:43:30 EST 2007

On 2/27/07, Paul Williams <pwilliams at diamonddata.com> wrote:
>
>
> Tres Seaver wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Paul Williams wrote:
> >> Ok, here is what we have.  I did a netstat on both machines, client and
> >> server.  The client sees and established connection and the server does
> >> not.  In the server log there is a disconnect.  As far as hardware
> >> between them, there is a switch (dell powerconnect 6024).  Web Server
> >> Directors might get hold of it but there are no hops on traceroute.
> >> Traceroute only shows the client machine and the server machine.
> >>
> >> So the client is just continuously polling the connection but getting
> >> nothing back.
> >
> > That sounds like some weird kernel / networking problem to me:  I don't
> > see how Zope could be able to keep calling 'select' on a socket after
> > the other side has closed it.
>
> We agree.  This is a strange situation that none of us have seen before.
>
> However, we have until tomorrow to do something and replacing hardware
> is not feasable.
>
>
> >
> > Is there any possibility that some kind of failover / IP takeover has
> > happened, such that the storage server now running is not the same host
> > / instance as the one to shich the clients originally connected?  Are
> > you using LVS + heartbeat, or some kind of hardware load balancer to
> > manage such redundancy?
>
> We do have Web Services Directors that do load balancing, but in this
> particular case, the storage server is not setup for load balancing, I
> am not aware of any features that make the zodb capable of clustering
> except for replication services offered through zope.
>
> We are not sure whether the traffic is going to the Web Services
> Directores or not.  Even if it is, there are thousands of settings and
> there is no-one available that knows what to change.
>
>
> The storage server is a simple nas server with a static ip address.
>
> >
> >> What we are thinking about doing is changing the code in
> >> zrpc/connection.py to close the connection in wait (line 638 zope
> >> version 2.9.5) if the wait time gets too large or the poll has happened
> >> too many times.
> >>
> >> We are great at plone development, but have very little backend zope
> >> development.  Would someone please advise me as to whether this is going
> >> to cause more problems?
> >
> > According to the log message you posted earlier in the thread, your
> > appservers are spewing thousands of log messages from the connection's
> > 'pending' method, although your deadlock debugger output shows the one
> > thread blocked on 'select' inside of the connection's 'wait' method.
> > There should be lots of log messages at TRACE level for the wait call,
> > including a doubling / backoff of the delay value from 1 mx to 1 sec.
> > Do you see those log messages, as well?
>
> These messages are there.  You can see the time doubling.  This is where
> we were thinking of breaking the connection once it gets to a certain
> point and make zope reconnect.
>
> This solves our hung connection problem, we think.  However, I am hoping
> someone can let me know if I am breaking something else by doing this.
>
>

I don't remember if you already mentioned it. However: did you tried
to monitor the traffic outgoing and incoming? I mean, setting some
iptables rules and/or using something like tcpdump to monitor what is
going on here?

Regards
Marco

-- 
Marco Bizzarri
http://iliveinpisa.blogspot.com/