[ZWeb] DISCUSS: Monitoring Zope.org
Simon Coles
simon@nipltd.com
Mon, 4 Sep 2000 13:48:29 +0100
>As many of you know, Zope.org has been sluggish and unresponsive at
>times this week. For the most part, this is related to our decision to
>use it for quickly baking important software like ZEO. As an aside, the
>irony is that ZEO, once baked, will make Zope.org more immune to
>downtime. Go figure.
Ah, the price of progress :-)
>Anyway, we should have a discussion related to this question:
>
>"How can the community find out the health of Zope.org when things are
>flaky?"
>
>I think it would be pretty useful if the community could get self-help
>on answering the "what's wrong" question. Imagine this exchange:
Yes, but of course the next question after they get an answer to "Is
Zope.org down?" will be "What can I do to get what I need?". I'll
deal with the second question later...
[snip community telling each other what's up]
>With that in mind, I have a specific proposal to help. I think we
>should:
>
>1) Zope.org sits behind Apache using mod_proxy for integration. We
>should find out what is the timeout for proxy connections. If it is
>configurable, we should dial it _way_ down (e.g. 20 seconds). If Zope
>doesn't respond, get it to say so in a reasonable period of time.
>
>2) Next, we should hack the Apache error page to be meaningful. For
>instance:
Yeah, we do that here, works well. At some point we may get it to
send a mail out whenever it has to send that page out on the
principle that seeing the extent of the problem helps our admin
people prioritise :-).
More generally, we monitor site health. There are a number of tools
around, but I ended up writing my own when we were having
performance problems (with Domino :-) and I wanted to get some
metrics. Nothing fancy, I just wanted to solve the immediate problem.
So I have some code which at random intervals monitors the time a
number of sites take to respond to an http request. Then you graph
these and check the correlation between them (currently this is done
manually). Picking the sites correctly can tell you if its a problem
with Internet connectivity, with Apache, with Zope, or with the
machine itself etc.
This could probably be modified to a page which gives an indication
of the Zope.org 'weather', not only at a particular instant but also
over a period of time. It could also page people :-)
I think it defeats the object to run these tools from the same
network as the server, as you aren't testing from a customer's point
of view. We could run it from here if that helps - we're a typical
distance away from you in Internet geography.
On the "What to do when Zope.org is down" front, it really is
disruptive for people when they can't get the stuff they need. I know
long term its being fixed, but that doesn't help us between now and
when its nice and robust. Zope.org really is an important tool when
you're developing Zope sites, and when its down its almost as painful
as losing one of our own servers - certainly within the NIP office we
know when Zope.org is down, and stuff gets delayed because of it.
For a temporary solution, how about taking a (say) weekly copy of
Zope.org's contents and putting it on a separate server, read only.
Tell a few people about it, and then when Zope.org is down, those
people in the know can point the community at the temporary site.
Maybe password it and change the password regularly to prevent people
using that rather than the main site.
Yeah, this is a hack and it sucks. I'm sure a lot of the normal
functionality of Zope.org won't work. But it would be quick to do, is
better than nothing, and would help us continue working.
Simon
--
--------- My opinions are my own, NIP's opinions are theirs ----------
Simon J. Coles Email: simon@nipltd.com
New Information Paradigms Work Phone: +44 1344 753703
http://www.nipltd.com/ Work Fax: +44 1344 753742
=============== Life is too precious to take seriously ===============