[Zope] Scaling problems, or something else?

Wed May 9 10:41:22 EDT 2007

On Wed, May 09, 2007 at 12:07:54PM +0200, Gaute Amundsen wrote:
> Just a quick call for ideas on this problem we have...
> 
> Setup: 
> Zope 2.7.5 (~180 sites) -> apache (-> varnish for 1 high profile site)
> 
> Most noticeable symtoms: 
> Takes 30 sec. or more to serve a page, or times out.
> Sporadic problem, but allways during general high load.
> Lasts less than 1 hour. 
> Restarting zope does not help.
> Lots of apache processes in '..reading..' state
> Apache accesses and volume is down.
> Server load is unchanged, and < 2.0
> Apache processes is way up (~250 aganinst <40)
> Netstat "established" connections is WAY up (~650 aganist < 50)

The increase in netstat connections and apache processes indicates
lots of simultaneous traffic, but it's interesting that Apache
accesses is down.  Since hits are logged only on completion, it may be
that many of the requests are hung.

> Is this zope hitting some sort of limit and just letting Apache hang? 
> Would setting up ZEO on the same box make a difference,

ZEO doesn't buy you any performance unless you have multiple Zope
clients reading from it, and a load balancer in front.  This will help
IF your application is CPU-bound, which yours is not (I assume by
server load you mean CPU).

ZEO can actually *hurt* if you're IO-bound, because it adds network
overhead to ZODB reads and writes. It's very bad if you have large
Image or File objects (which you probably shouldn't have in the ZODB
anyway).

> or would it be better 
> to extend varnish coverage?

Probably a good idea anyway... but you want to find out what the
problem really is.

> What would you do to pinpoint the problem?

I'd first try hitting Zope directly during the problem to see if the
slowdown is there.  If so, I'd then try either:

- DeadlockDebugger may be informative.
  http://www.zope.org/Members/nuxeo/Products/DeadlockDebugger

- Enable Zope's trace log and use requestprofiler.py to see if there
  is a pattern to the requests that trigger the problem.  Eg. maybe
  all your zope worker threads are waiting on some slow IO task. See
  the logger section of zope.conf.

-- 

Paul Winkler
http://www.slinkp.com