[Zope] Scaling problems, or something else?

Wed May 9 20:10:40 EDT 2007

On Wednesday 09 May 2007 16:41, Paul Winkler wrote:
> On Wed, May 09, 2007 at 12:07:54PM +0200, Gaute Amundsen wrote:
> > Just a quick call for ideas on this problem we have...
> >
> > Setup:
> > Zope 2.7.5 (~180 sites) -> apache (-> varnish for 1 high profile site)
> >
> > Most noticeable symtoms:
> > Takes 30 sec. or more to serve a page, or times out.
> > Sporadic problem, but allways during general high load.
> > Lasts less than 1 hour.
> > Restarting zope does not help.
> > Lots of apache processes in '..reading..' state
> > Apache accesses and volume is down.
> > Server load is unchanged, and < 2.0
> > Apache processes is way up (~250 aganinst <40)
> > Netstat "established" connections is WAY up (~650 aganist < 50)
>
> The increase in netstat connections and apache processes indicates
> lots of simultaneous traffic, but it's interesting that Apache
> accesses is down.  Since hits are logged only on completion, it may be
> that many of the requests are hung.
>
That was my reasoning too.

> > Is this zope hitting some sort of limit and just letting Apache hang?
> > Would setting up ZEO on the same box make a difference,
>
> ZEO doesn't buy you any performance unless you have multiple Zope
> clients reading from it, and a load balancer in front.  This will help
> IF your application is CPU-bound, which yours is not (I assume by
> server load you mean CPU).

So there is no other possible limit in a zope instance than IO or CPU? 
If cpu was the limiting factor I would see the 2 python processes running 90% 
and dozens of httpd's taking up the rest?

Can You think of any good parameters I can get at with a small script that 
would be good to graph with all the rest to shed som light on this? 
(we are using munin)

Something out of Control_Panel/Database/main/manage_activity perhaps?
Is there a way to get that data out without going through port 8080?

How about something out of /proc/`cat /home/zope/sites/site1/var/Z2.pid`/XXX?
Need to read up on procfs I guess.

> ZEO can actually *hurt* if you're IO-bound, because it adds network
> overhead to ZODB reads and writes. It's very bad if you have large
> Image or File objects (which you probably shouldn't have in the ZODB
> anyway).
>
Good to hear. I ws not particularly relishing the thought of the nescesary 
load balancing on that single box either :-/

> > or would it be better
> > to extend varnish coverage?
>
> Probably a good idea anyway... but you want to find out what the
> problem really is.
>
> > What would you do to pinpoint the problem?
>
> I'd first try hitting Zope directly during the problem to see if the
> slowdown is there.  If so, I'd then try either:
>
Should be possible with lynx on localhost.
Have done that before for other purpose,  Should have thought of that.

Maybe I wil start logging the responstime directly like that! 
Hm.. good idea :)

> - DeadlockDebugger may be informative.
>   http://www.zope.org/Members/nuxeo/Products/DeadlockDebugger
>
Sounds a little drastic on a production server, but it may stil come to that..
Ought to test it out on another server I guess.

> - Enable Zope's trace log and use requestprofiler.py to see if there
>   is a pattern to the requests that trigger the problem.  Eg. maybe
>   all your zope worker threads are waiting on some slow IO task. See
>   the logger section of zope.conf.

That looks interesting, except that it can take 15 minutes or more to restart 
zope when load is at the worst. I could try it outside of peak ours I guess.

Thanks for the innput.
Really helped get me unstuck, as you can see :)

Regards

Gaute