[Zope] Summary : server-down / debugging

chas panda@skinnyhippo.com
Tue, 14 Dec 1999 14:29:41 +0800


Exactly a week ago, I had my server go down for 24-36 hrs
as described in a couple of "Urgent - server down" threads.

First and foremost, I would like to say a most sincere
thank you to everyone for helping. 

Secondly, my apologies for not posting a follow-up earlier
but I wanted to make sure that things really had been 
resolved first; I also wanted to try to work out the cause
of the problem.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

---------
Refresh :
---------

1) FreeBSD 3.2, Zope 2.01 PCGI to Apache, MySQL
   My Zope server kept "spinning" (consuming 90-100% CPU)
   Connecting via the web resulted in the error message :

	"Temporarily Unavailable 	
	The resource you requested is temporarily unavailable - 
	please try again later. 
	(116) unable to connect, fd=3 "
   
   Stopping and starting Zope allowed the server to run again
   for anything between 1 minute and 8 hrs (depending on load),
   until the problem reappeared.

   
2) Decided to reinstall Zope 2.10 
   Spinning disappeared, now it seemed to hang - 0% load.
   Error message displayed to users was (regardless of URL) :

	Zope Error
	Zope has encountered an error while publishing this resource. 
	Error Type: NameError
	Error Value: name_param

   Within the admin interface, the error message was different :

	Zope Error
	Zope has encountered an error while publishing this resource. 
	Error Type: NameError
	Error Value: type

   Unfortunately, I failed to get tracebacks before making changes
   that kept the server up.


----------
Solution :
----------
I started to undo the latest changes (thanks for the undo tab !). 
First thing I did was remove a property that was on one folder (not
the root folder). Since doing this, the server has not encountered
any subsequent problem. 

We're still not sure why this should be so because URLs which did 
not require acquisition or any methods/properties from the folder 
in question had been giving error messages previously. That's why
I'd never tried this before - it seemed so unnatural and unlikely
to be the cause.

I get the feeling (and I realize that this is as untechnical as
slapping the side of your TV/monitor in order to correct the screen 
display), that the Zope object database is sort of in different 
states and just needed to be changed. The reason that I say this is
that we witnessed further changes in behaviour caused by nothing
more than making minor, unrelated changes to the database. 

For example, 5 days ago, we found that we could not pack the Zope 
database (Data.fs), even though the server was apparently up and
running again.  The traceback was :

  Traceback (innermost last):
    File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 214, in
publish_module
    File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 179, in
publish
    File /var/zope/51go-p2/lib/python/Zope/__init__.py, line 202, in
zpublisher_exception_hook
    (Object: Fake)
    File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 165, in
publish
    File /var/zope/51go-p2/lib/python/ZPublisher/mapply.py, line 160, in
mapply
    (Object: manage_pack)
    File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 102, in
call_object
    (Object: manage_pack)
    File /var/zope/51go-p2/lib/python/App/ApplicationManager.py, line 338,
in manage_pack
    (Object: ElementWithAttributes)
    File /var/zope/51go-p2/lib/python/ZODB/DB.py, line 514, in pack
    File /var/zope/51go-p2/lib/python/ZODB/FileStorage.py, line 867, in pack
    (Object: /var/zope/51go-p2/var/Data.fs)
  IOError: (see above)

Yet yesterday, I tried to pack the database once more and it worked
fine. The only difference between yesterday and 5 days ago were that 
we'd added a few more DTML methods/documents to the database.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-------------
Conclusions :
-------------

As you can see, I don't really have any conclusions - I'm not sure
why we encountered these problems.  

The biggest lesson we learned were all of the extra debugging options
that were available (many of which we were totally unaware even existed).
My advice is that, if you are not yet aware of how to use them, test
these out before you really need them in a crisis scenario.


Accumulated from posts by Michel, Brian Lloyd, Jim Fulton :
-----------------------------------------------------------
  o providing details on your environment (OS, version, zope 
    version in use). Have there been any recent changes in
    the environment?

  o symptom description - does CPU go to 100%? If so, does 
    there seem to be any pattern to the requests preceding
    the hang that you can determine (perhaps by looking at
    the server log)? Can you regularly reproduce it using
    a contrived test (like using ab or another benchmarking
    tool on a particular url or sequence of urls)? Does the
    problem happen only when accessing the system via pcgi
    or ZServer (or both)?

  o run in debug mode (using -D) and send any funny looking
    or unusual messages that you see

  o provide any unexplained tracebacks that you see from your 
    site. If you aren't in debug mode, then you'll need to view 
    the document source to see them,

  o Look at the "debug" screen at Control_Panel/manage_debug.
    This can be used to spot memory leaks and stuck database 
    connections. If you think you may see something suspicious,
    save the document src and send it to us.

  o The Zope event log can be very helpful. See doc/LOGGING.txt,

  o If the above doesn't show anything, then a more detailed log may be
created:

    Add this stanza to your z2.py start script. 

    # turn on debug logging
    from ZServer import DebugLogger
    logfile=os.path.join(INSTANCE_HOME,'var/debug.log')
    DebugLogger.log=DebugLogger.DebugLogger(logfile).log

    It should be insertted after Zope is imported, e.g. after this line:

    exec "import "+MODULE in {}

    You'll want to watch the debug log file since it gets large quickly. 
    You might also want to read the docstrings in ZServer/DebugLogger.py 
    for more information about the debug log format.
   
    In particular, we want to look for requests that don't complete
    or for apparent leaking requests.

    Use the debugger (Michel posted a good run-down on this, attached
    below for convenience)


  o Keep backing up Data.fs regularly so that you can roll back to
    a usable version.

  o Paid support from DC

  o /Control_Panel/manage_profile

    Yeah that's another little hidden beauty.  Both 'manage_debug' and
    'manage_profile' have been hidden from view, I don't know the reason
    why.  I think it would be a good idea to show these links in the control
    panel if ZOPE_DEBUG_MODE is on.



Stephen Bishop also suggested this neat trick :
-----------------------------------------------
>For a test, start up your server in single thread mode. When the problem
>next occurs, the last page accessed should be the cause of your problem.
>If the standard log doesn't do the trick (medusa might log incomming requests
>before they are passed off to Zope handlers), you can probably put your own 
>logging information in zhttp_channel.work in ZServer/HTTPServer.py unless
>you work out how to turn on the built in debug logging.



This is how you get into the debugger.
--------------------------------------
Shut down your Zope.

Go to the lib/python directory

Run python:

[michel@korak python]$ python1.5.2
Python 1.5.2 (#1, Jul  5 1999, 14:47:37)  [GCC egcs-2.91.66
19990314/Linux (egcs- on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam

import Zope and ZPublisher

Py$ import ZPublisher, Zope

Now you can issue requests to Zope just like a web browser:

Py$ ZPublisher.Zope('index_html')

will call and return the 'index_html' object in the root folder.

You can enter the debugger with the d=1 argument:

Py$ ZPublisher.Zope('index_html', d=1)
* Type "s<cr>c<cr>" to jump to beginning of real publishing process.
* Then type c<cr> to jump to the beginning of the URL traversal
  algorithm.
* Then type c<cr> to jump to published object call.
> <string>(0)?()
pdb> 

This fires you into the python debugger, pdb. First off, hit 's' to step
into the debugger.

 For your convinience, two breakpoints are set for you.  Hit 'c' once to
continue to the first breakpoint, which is at the beginning of the URL
traversal to find your object.  I ususaly skip beyond this as it is
really only useful to look at the request before it traversed and to
debug ZPublisher.

Hit 'c' again to continue past the first breakpoint to the second, which
is rigt before the object you are looking for is called.  This is most
useful to debug application specific logic.

__________________________________________________________________



Apologies if I've omitted anything/anyone - again, thanks 
to _everybody_ who helped. 


chas