[ZWeb] DISCUSS: Monitoring Zope.org

Simon Coles simon@nipltd.com
Mon, 4 Sep 2000 13:48:29 +0100


>As many of you know, Zope.org has been sluggish and unresponsive at
>times this week.  For the most part, this is related to our decision to
>use it for quickly baking important software like ZEO.  As an aside, the
>irony is that ZEO, once baked, will make Zope.org more immune to
>downtime.  Go figure.

Ah, the price of progress :-)

>Anyway, we should have a discussion related to this question:
>
>"How can the community find out the health of Zope.org when things are
>flaky?"
>
>I think it would be pretty useful if the community could get self-help
>on answering the "what's wrong" question.  Imagine this exchange:

Yes, but of course the next question after they get an answer to "Is 
Zope.org down?" will be "What can I do to get what I need?". I'll 
deal with the second question later...

[snip community telling each other what's up]

>With that in mind, I have a specific proposal to help.  I think we
>should:
>
>1) Zope.org sits behind Apache using mod_proxy for integration.  We
>should find out what is the timeout for proxy connections.  If it is
>configurable, we should dial it _way_ down (e.g. 20 seconds).  If Zope
>doesn't respond, get it to say so in a reasonable period of time.
>
>2) Next, we should hack the Apache error page to be meaningful.  For
>instance:

Yeah, we do that here, works well. At some point we may get it to 
send a mail out whenever it has to send that page out on the 
principle that seeing the extent of the problem helps our admin 
people prioritise :-).

More generally, we monitor site health. There are a number of tools 
around, but  I ended up writing my own when we were having 
performance problems (with Domino :-) and I wanted to get some 
metrics. Nothing fancy, I just wanted to solve the immediate problem.

So I have some code which at random intervals monitors the time a 
number of sites take to respond to an http request. Then you graph 
these and check the correlation between them (currently this is done 
manually). Picking the sites correctly can tell you if its a problem 
with Internet connectivity, with Apache, with Zope, or with the 
machine itself etc.

This could probably be modified to a page which gives an indication 
of the Zope.org 'weather', not only at a particular instant but also 
over a period of time. It could also page people :-)

I think it defeats the object to run these tools from the same 
network as the server, as you aren't testing from a customer's point 
of view. We could run it from here if that helps - we're a typical 
distance away from you in Internet geography.


On the "What to do when Zope.org is down" front, it really is 
disruptive for people when they can't get the stuff they need. I know 
long term its being fixed, but that doesn't help us between now and 
when its nice and robust. Zope.org really is an important tool when 
you're developing Zope sites, and when its down its almost as painful 
as losing one of our own servers - certainly within the NIP office we 
know when Zope.org is down, and stuff gets delayed because of it.

For a temporary solution, how about taking a (say) weekly copy of 
Zope.org's contents and putting it on a separate server, read only. 
Tell a few people about it, and then when Zope.org is down, those 
people in the know can point the community at the temporary site. 
Maybe password it and change the password regularly to prevent people 
using that rather than the main site.

Yeah, this is a hack and it sucks. I'm sure a lot of the normal 
functionality of Zope.org won't work. But it would be quick to do, is 
better than nothing, and would help us continue working.



Simon
-- 
--------- My opinions are my own, NIP's opinions are theirs ----------
Simon J. Coles                                 Email: simon@nipltd.com
New Information Paradigms                  Work Phone: +44 1344 753703
http://www.nipltd.com/                     Work Fax:   +44 1344 753742
=============== Life is too precious to take seriously ===============