On Wednesday 09 May 2007 16:41, Paul Winkler wrote:
On Wed, May 09, 2007 at 12:07:54PM +0200, Gaute Amundsen wrote:
Just a quick call for ideas on this problem we have...
Setup: Zope 2.7.5 (~180 sites) -> apache (-> varnish for 1 high profile site)
Most noticeable symtoms: Takes 30 sec. or more to serve a page, or times out. Sporadic problem, but allways during general high load. Lasts less than 1 hour. Restarting zope does not help. Lots of apache processes in '..reading..' state Apache accesses and volume is down. Server load is unchanged, and < 2.0 Apache processes is way up (~250 aganinst <40) Netstat "established" connections is WAY up (~650 aganist < 50)
The increase in netstat connections and apache processes indicates lots of simultaneous traffic, but it's interesting that Apache accesses is down. Since hits are logged only on completion, it may be that many of the requests are hung.
That was my reasoning too.
Is this zope hitting some sort of limit and just letting Apache hang? Would setting up ZEO on the same box make a difference,
ZEO doesn't buy you any performance unless you have multiple Zope clients reading from it, and a load balancer in front. This will help IF your application is CPU-bound, which yours is not (I assume by server load you mean CPU).
So there is no other possible limit in a zope instance than IO or CPU? If cpu was the limiting factor I would see the 2 python processes running 90% and dozens of httpd's taking up the rest? Can You think of any good parameters I can get at with a small script that would be good to graph with all the rest to shed som light on this? (we are using munin) Something out of Control_Panel/Database/main/manage_activity perhaps? Is there a way to get that data out without going through port 8080? How about something out of /proc/`cat /home/zope/sites/site1/var/Z2.pid`/XXX? Need to read up on procfs I guess.
ZEO can actually *hurt* if you're IO-bound, because it adds network overhead to ZODB reads and writes. It's very bad if you have large Image or File objects (which you probably shouldn't have in the ZODB anyway).
Good to hear. I ws not particularly relishing the thought of the nescesary load balancing on that single box either :-/
or would it be better to extend varnish coverage?
Probably a good idea anyway... but you want to find out what the problem really is.
What would you do to pinpoint the problem?
I'd first try hitting Zope directly during the problem to see if the slowdown is there. If so, I'd then try either:
Should be possible with lynx on localhost. Have done that before for other purpose, Should have thought of that. Maybe I wil start logging the responstime directly like that! Hm.. good idea :)
- DeadlockDebugger may be informative. http://www.zope.org/Members/nuxeo/Products/DeadlockDebugger
Sounds a little drastic on a production server, but it may stil come to that.. Ought to test it out on another server I guess.
- Enable Zope's trace log and use requestprofiler.py to see if there is a pattern to the requests that trigger the problem. Eg. maybe all your zope worker threads are waiting on some slow IO task. See the logger section of zope.conf.
That looks interesting, except that it can take 15 minutes or more to restart zope when load is at the worst. I could try it outside of peak ours I guess. Thanks for the innput. Really helped get me unstuck, as you can see :) Regards Gaute