[Zope] Zope woes continue - server going down regularly

Wed, 08 Dec 1999 01:56:30 +0800

First and foremost, most sincere thanks to Jim for the detailed
advice/help, with the invaluable tips on debugging. I'm no closer
to actually finding out what the problem is but at least feel
less helpless ... I can at least try to work out what is going
wrong.  

Comments follow..

Jim Fulton wrote :
>> I installed and set up Zope 2.10 (using PCGI behind Apache) and
>> transfered my site over it. It *seemed* to be holding up OK.
>> CPU never went up to 90%. But I noticed that several times I'd
>> get the fd=3 error.
>> 
>> Then within 2 hours of use, I get this error when external
>> users try to access the site :
>> 
>>     Zope Error
>>     Zope has encountered an error while publishing this resource.
>>     Error Type: NameError
>>     Error Value: name_param
>
>Could you provide a traceback?
> 
>> Even from the administration interface, I can't access anything.
>> Everything returns :
>> 
>>     Zope Error
>>     Zope has encountered an error while publishing this resource.
>>     Error Type: NameError
>>     Error Value: type
>
>Again, a traceback would be helpful.

I'm now in the somewhat ironic, frustrating (and slightly embarassing)
situation of having a well-behaved server. It's doing everything properly
without a problem. I'm trying very hard to reproduce the errors - running
linkbot through the site to check every URL, running 'ab' to increase
load ... but to no avail. Earlier, I simply had to restart the server 
and let it run for 10 minutes and it was a safe bet that the errors would
occur. The server's been running fine for the past 2 hrs now.  

>> To this end, does anybody have any ideas :
>> 1) where to get better debugging information to help identify the
>>    source of the problem.
>
>- Obviously something changed.  You were running fine for 2 months, 
>  than started having problems.  You can work forward by analysing
>  changes, or you could work backward by debugging.
>
>- Tracebacks are helpful. If you aren't in debug mode, then 
>  you'll need to view the document source to see them,

Will send as soon as I can get the server to play up again.

>- The Zope event log can be very helpful. See doc/LOGGING.txt,
>
>- Look at the "debug" screen at Control_Panel/manage_debug.
>  This can be used to spot memory leaks and stuck database connections.

Wow - *that* was a discovery. Why isn't it linked in from the Control_Panel ?
I had to type in the URL to get to it. Then again, I didn't really know what
I was looking for there.

>- If the above doesn't show anything, then a more detailed log may be
created:
>
>  Add this stanza to your z2.py start script. 
>
>  # turn on debug logging
>  from ZServer import DebugLogger
>  logfile=os.path.join(INSTANCE_HOME,'var/debug.log')
>  DebugLogger.log=DebugLogger.DebugLogger(logfile).log
>
>  It should be insertted after Zope is imported, e.g. after this line:
>
>  exec "import "+MODULE in {}

Thank you very much for this detailed description of
how to set it up :)

>  You'll want to watch the debug log file since it gets large quickly. 
>  You might also want to read the docstrings in ZServer/DebugLogger.py 
>  for more information about the debug log format.
>   
>  In particular, we want to look for requests that don't complete
>  or for apparent leaking requests.

OK.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Jim wrote :
>> It's been running fine for the past 2 months.
>> Then today we've been getting these error messages
>> almost non-stop :
>
>It would be interesting to compare the ZODB from the day
>(or two) before the trouble started to the current one.
>Similarly to look at the undo log.

I've checked the undo log. There were only 6 entries for the past
48 hours (it was the weekend). I've backed up Data.fs and tried
reversing them. This also has coincided with the uptime of 2 hrs.

Btw, I'm not sure if it's related or not, but I find that I can't
pack any of the databases. (Control_Panel -> Database -> Pack)
The traceback is :

Traceback (innermost last):
  File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 214, in
publish_module
  File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 179, in
publish
  File /var/zope/51go-p2/lib/python/Zope/__init__.py, line 202, in
zpublisher_exception_hook
    (Object: Fake)
  File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 165, in
publish
  File /var/zope/51go-p2/lib/python/ZPublisher/mapply.py, line 160, in mapply
    (Object: manage_pack)
  File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 102, in
call_object
    (Object: manage_pack)
  File /var/zope/51go-p2/lib/python/App/ApplicationManager.py, line 338, in
manage_pack
    (Object: ElementWithAttributes)
  File /var/zope/51go-p2/lib/python/ZODB/DB.py, line 514, in pack
  File /var/zope/51go-p2/lib/python/ZODB/FileStorage.py, line 867, in pack
    (Object: /var/zope/51go-p2/var/Data.fs)
IOError: (see above)

Even chmod'ing  Data.fs to 666 doesn't help.  Unfortunately I 
can't tell you when this started to occur b/c I've never packed 
the DB until today (which explains why it's so big !).

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[ Michel to Thomas regarding Stability : ] 

Michel wrote :
>Just an aside, we do have excellent paid support.  It is not
>unreasonably expensive and you get to talk directly to us and get a
>prompt report of our analysis of your problem.

We will indeed contact you directly regarding this, for future
contingency. I'm too used to being in 'startup-with-no-budget'
mode where I have to do everything myself, so it never occured
to me before.

> Well, good back ups and all that are an excellent strategy.  Further, I
> almost wonder if you had a backup of the site when it didn't have this
> problem, if you fired a copy of that up on another port and the problem
> went away.  As it stands now, it could be a bug in Zope, or something
> changed on your system.

I've been trying just that too - the older database has no errors.

> The most important part of debugging is reproduction.  Reproducing the
> problem in a controled way can be the hardest part but it is the most
> important.

Yes, as somebody who often gets people coming to him saying "it's broken, 
I don't know why/how or what I did and no, I can't make it happen again", 
I'm fully aware of the importance of repeatability. I've been holding out 
before posting to the list, esp. in reply to Jim's very helpful message, 
because I really need/want to be able to reproduce the errors.
It's precisely this inability to reproduce it that leaves me feeling 
helpless - a bit  like when Windows software crashes for no reason 
whatsoever.

> As a stopgap so that you can get some sleep, you can set up a cron
> script to reset the server every now and then either nicely (call the
> servers manage_shutdown) or mean (send it a signal).

This is the method that I will indeed use if this problem persists 
in the morning. If I can, I'll actually do a test (using ZClient or
urllib) to first check if the server is responding correctly before
doing a restart. Until then, I have 6 hrs to see if I can reproduce
some errors.