[Zope] Zope woes continue - server going down regularly
chas
panda@skinnyhippo.com
Wed, 08 Dec 1999 01:56:30 +0800
First and foremost, most sincere thanks to Jim for the detailed
advice/help, with the invaluable tips on debugging. I'm no closer
to actually finding out what the problem is but at least feel
less helpless ... I can at least try to work out what is going
wrong.
Comments follow..
Jim Fulton wrote :
>> I installed and set up Zope 2.10 (using PCGI behind Apache) and
>> transfered my site over it. It *seemed* to be holding up OK.
>> CPU never went up to 90%. But I noticed that several times I'd
>> get the fd=3 error.
>>
>> Then within 2 hours of use, I get this error when external
>> users try to access the site :
>>
>> Zope Error
>> Zope has encountered an error while publishing this resource.
>> Error Type: NameError
>> Error Value: name_param
>
>Could you provide a traceback?
>
>> Even from the administration interface, I can't access anything.
>> Everything returns :
>>
>> Zope Error
>> Zope has encountered an error while publishing this resource.
>> Error Type: NameError
>> Error Value: type
>
>Again, a traceback would be helpful.
I'm now in the somewhat ironic, frustrating (and slightly embarassing)
situation of having a well-behaved server. It's doing everything properly
without a problem. I'm trying very hard to reproduce the errors - running
linkbot through the site to check every URL, running 'ab' to increase
load ... but to no avail. Earlier, I simply had to restart the server
and let it run for 10 minutes and it was a safe bet that the errors would
occur. The server's been running fine for the past 2 hrs now.
>> To this end, does anybody have any ideas :
>> 1) where to get better debugging information to help identify the
>> source of the problem.
>
>- Obviously something changed. You were running fine for 2 months,
> than started having problems. You can work forward by analysing
> changes, or you could work backward by debugging.
>
>- Tracebacks are helpful. If you aren't in debug mode, then
> you'll need to view the document source to see them,
Will send as soon as I can get the server to play up again.
>- The Zope event log can be very helpful. See doc/LOGGING.txt,
>
>- Look at the "debug" screen at Control_Panel/manage_debug.
> This can be used to spot memory leaks and stuck database connections.
Wow - *that* was a discovery. Why isn't it linked in from the Control_Panel ?
I had to type in the URL to get to it. Then again, I didn't really know what
I was looking for there.
>- If the above doesn't show anything, then a more detailed log may be
created:
>
> Add this stanza to your z2.py start script.
>
> # turn on debug logging
> from ZServer import DebugLogger
> logfile=os.path.join(INSTANCE_HOME,'var/debug.log')
> DebugLogger.log=DebugLogger.DebugLogger(logfile).log
>
> It should be insertted after Zope is imported, e.g. after this line:
>
> exec "import "+MODULE in {}
Thank you very much for this detailed description of
how to set it up :)
> You'll want to watch the debug log file since it gets large quickly.
> You might also want to read the docstrings in ZServer/DebugLogger.py
> for more information about the debug log format.
>
> In particular, we want to look for requests that don't complete
> or for apparent leaking requests.
OK.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jim wrote :
>> It's been running fine for the past 2 months.
>> Then today we've been getting these error messages
>> almost non-stop :
>
>It would be interesting to compare the ZODB from the day
>(or two) before the trouble started to the current one.
>Similarly to look at the undo log.
I've checked the undo log. There were only 6 entries for the past
48 hours (it was the weekend). I've backed up Data.fs and tried
reversing them. This also has coincided with the uptime of 2 hrs.
Btw, I'm not sure if it's related or not, but I find that I can't
pack any of the databases. (Control_Panel -> Database -> Pack)
The traceback is :
Traceback (innermost last):
File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 214, in
publish_module
File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 179, in
publish
File /var/zope/51go-p2/lib/python/Zope/__init__.py, line 202, in
zpublisher_exception_hook
(Object: Fake)
File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 165, in
publish
File /var/zope/51go-p2/lib/python/ZPublisher/mapply.py, line 160, in mapply
(Object: manage_pack)
File /var/zope/51go-p2/lib/python/ZPublisher/Publish.py, line 102, in
call_object
(Object: manage_pack)
File /var/zope/51go-p2/lib/python/App/ApplicationManager.py, line 338, in
manage_pack
(Object: ElementWithAttributes)
File /var/zope/51go-p2/lib/python/ZODB/DB.py, line 514, in pack
File /var/zope/51go-p2/lib/python/ZODB/FileStorage.py, line 867, in pack
(Object: /var/zope/51go-p2/var/Data.fs)
IOError: (see above)
Even chmod'ing Data.fs to 666 doesn't help. Unfortunately I
can't tell you when this started to occur b/c I've never packed
the DB until today (which explains why it's so big !).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ Michel to Thomas regarding Stability : ]
Michel wrote :
>Just an aside, we do have excellent paid support. It is not
>unreasonably expensive and you get to talk directly to us and get a
>prompt report of our analysis of your problem.
We will indeed contact you directly regarding this, for future
contingency. I'm too used to being in 'startup-with-no-budget'
mode where I have to do everything myself, so it never occured
to me before.
> Well, good back ups and all that are an excellent strategy. Further, I
> almost wonder if you had a backup of the site when it didn't have this
> problem, if you fired a copy of that up on another port and the problem
> went away. As it stands now, it could be a bug in Zope, or something
> changed on your system.
I've been trying just that too - the older database has no errors.
> The most important part of debugging is reproduction. Reproducing the
> problem in a controled way can be the hardest part but it is the most
> important.
Yes, as somebody who often gets people coming to him saying "it's broken,
I don't know why/how or what I did and no, I can't make it happen again",
I'm fully aware of the importance of repeatability. I've been holding out
before posting to the list, esp. in reply to Jim's very helpful message,
because I really need/want to be able to reproduce the errors.
It's precisely this inability to reproduce it that leaves me feeling
helpless - a bit like when Windows software crashes for no reason
whatsoever.
> As a stopgap so that you can get some sleep, you can set up a cron
> script to reset the server every now and then either nicely (call the
> servers manage_shutdown) or mean (send it a signal).
This is the method that I will indeed use if this problem persists
in the morning. If I can, I'll actually do a test (using ZClient or
urllib) to first check if the server is responding correctly before
doing a restart. Until then, I have 6 hrs to see if I can reproduce
some errors.