Hello all, Zope 2.6.4 has got stuck several times on my server. I have Debian sarge with the zope 2.6.4-1 package and python 2.2.3-7 (I know Zope 2.6 should use python2.1, but the debian package uses 2.2). I don't use ZEO. The problems started recently but I cannot pin down any specific event that triggered it. It may be related to upgrading zope from 2.6.2, or a Plone upgrade (I now have 2.0RC6, used to have RC3), or a kernel upgrade. The machine is a 2-processor Xeon currently running a SMP kernel (kernel-image-2.4.25-1-686-smp). Zope runs with the default 4 threads. Zope is running behind Apache. The symptom is that zope stops responding to requests, so the requests just hang until timeout. The processes are running, but a kill -TERM won't kill them. kill -9 helps. I haven't yet been able to repeat this in debug mode because the event is fairly rare. Also I haven't debugged the locked up processes with gdb, but will do so the next time it happens following instructions I've seen on this list. However, if you have any advice on how to further investigate this problem, I'd love to hear it. Below are some excerpts from Z2-detailed.log that show the sequence of events leading to lockup. It seems that two requests come in at the same time, and a crash happens in the middle of processing. I'm grateful for any help you can provide, -Osma Z2-detailed.log.1:A 190481452 2004-03-16T12:40:24 200 7607 Z2-detailed.log.1:E 190481452 2004-03-16T12:40:24 Z2-detailed.log.1:B 190559468 2004-03-16T12:40:24 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr/ploneIEFixes.css Z2-detailed.log.1:I 190559468 2004-03-16T12:40:24 0 Z2-detailed.log.1:A 190559468 2004-03-16T12:40:24 200 1702 Z2-detailed.log.1:E 190559468 2004-03-16T12:40:24 Z2-detailed.log.1:B 181840020 2004-03-16T12:40:26 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//lpl/uutiset Z2-detailed.log.1:I 181840020 2004-03-16T12:40:26 0 Z2-detailed.log.1:B 190159428 2004-03-16T12:40:26 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr/asukkaan_opas Z2-detailed.log.1:I 190159428 2004-03-16T12:40:26 0 Z2-detailed.log.1:U 000000000 2004-03-16T12:40:27 System startup Z2-detailed.log.1:A 204515844 2004-03-16T15:00:20 200 1702 Z2-detailed.log.1:E 204515844 2004-03-16T15:00:20 Z2-detailed.log.1:B 173323884 2004-03-16T15:00:23 POST /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot// Z2-detailed.log.1:I 173323884 2004-03-16T15:00:23 312 Z2-detailed.log.1:A 173323884 2004-03-16T15:00:23 200 718 Z2-detailed.log.1:E 173323884 2004-03-16T15:00:23 Z2-detailed.log.1:B 207153412 2004-03-16T15:00:23 POST /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//aks/foorumi1/toijala/forum_add_topic_form Z2-detailed.log.1:I 207153412 2004-03-16T15:00:23 1492 Z2-detailed.log.1:B 207618796 2004-03-16T15:00:24 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr Z2-detailed.log.1:I 207618796 2004-03-16T15:00:24 0 Z2-detailed.log.1:U 000000000 2004-03-16T15:00:24 System startup Z2-detailed.log.1:B 184582692 2004-03-17T11:31:50 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//lpl/navtree-item-active.gif Z2-detailed.log.1:I 184582692 2004-03-17T11:31:50 0 Z2-detailed.log.1:A 184582692 2004-03-17T11:31:50 500 1749 Z2-detailed.log.1:E 184582692 2004-03-17T11:31:50 Z2-detailed.log.1:B 186258404 2004-03-17T11:31:50 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//lpl/bg_footer.gif Z2-detailed.log.1:I 186258404 2004-03-17T11:31:50 0 Z2-detailed.log.1:A 186258404 2004-03-17T11:31:50 500 1749 Z2-detailed.log.1:E 186258404 2004-03-17T11:31:50 Z2-detailed.log.1:A 185418812 2004-03-17T11:31:50 500 709 Z2-detailed.log.1:E 185418812 2004-03-17T11:31:50 Z2-detailed.log.1:U 000000000 2004-03-17T11:31:50 System startup Z2-detailed.log.1:A 174372980 2004-03-20T18:03:58 304 316 Z2-detailed.log.1:E 174372980 2004-03-20T18:03:58 Z2-detailed.log.1:B 174079364 2004-03-20T18:04:03 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr Z2-detailed.log.1:I 174079364 2004-03-20T18:04:03 0 Z2-detailed.log.1:B 175770628 2004-03-20T18:04:03 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr/asukkaan_opas Z2-detailed.log.1:I 175770628 2004-03-20T18:04:03 0 Z2-detailed.log.1:A 174079364 2004-03-20T18:04:03 200 24993 Z2-detailed.log.1:E 174079364 2004-03-20T18:04:03 Z2-detailed.log.1:B 176679788 2004-03-20T18:04:03 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//tintr/asukkaan_opas Z2-detailed.log.1:I 176679788 2004-03-20T18:04:03 0 Z2-detailed.log.1:U 000000000 2004-03-20T18:04:04 System startup Z2-detailed.log.2.gz:A 243458436 2004-03-11T17:29:34 200 1700 Z2-detailed.log.2.gz:E 243458436 2004-03-11T17:29:34 Z2-detailed.log.2.gz:B 210607836 2004-03-11T17:29:34 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//nkst/plone_javascripts.js Z2-detailed.log.2.gz:I 210607836 2004-03-11T17:29:34 0 Z2-detailed.log.2.gz:A 210607836 2004-03-11T17:29:34 304 316 Z2-detailed.log.2.gz:E 210607836 2004-03-11T17:29:34 Z2-detailed.log.2.gz:B 243686164 2004-03-11T17:29:36 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//nkst/tarjouspyynto Z2-detailed.log.2.gz:I 243686164 2004-03-11T17:29:36 0 Z2-detailed.log.2.gz:B 243745428 2004-03-11T17:29:36 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//nkst/toimi_nain/paatos/paatos_kaikki Z2-detailed.log.2.gz:I 243745428 2004-03-11T17:29:36 0 Z2-detailed.log.2.gz:U 000000000 2004-03-11T17:29:36 System startup Z2-detailed.log.2.gz:B 181253292 2004-03-12T10:21:24 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//aks/document_icon.gif Z2-detailed.log.2.gz:I 181253292 2004-03-12T10:21:24 0 Z2-detailed.log.2.gz:A 181253292 2004-03-12T10:21:24 304 299 Z2-detailed.log.2.gz:E 181253292 2004-03-12T10:21:24 Z2-detailed.log.2.gz:B 180156052 2004-03-12T10:21:24 GET /VirtualHostBase/http/prm.mbc.fi:80/prmmedia/VirtualHostRoot//aks/arkisto/bar_1_10.gif Z2-detailed.log.2.gz:I 180156052 2004-03-12T10:21:24 0 Z2-detailed.log.2.gz:A 180156052 2004-03-12T10:21:24 304 298 Z2-detailed.log.2.gz:E 180156052 2004-03-12T10:21:24 Z2-detailed.log.2.gz:A 178613132 2004-03-12T10:21:24 500 709 Z2-detailed.log.2.gz:E 178613132 2004-03-12T10:21:24 Z2-detailed.log.2.gz:U 000000000 2004-03-12T10:21:25 System startup -- *** Osma Suominen *** ozone@iki.fi *** http://www.iki.fi/ozone/ ***
Osma Suominen wrote at 2004-3-22 14:20 +0200:
... Zope is running behind Apache. The symptom is that zope stops responding to requests, so the requests just hang until timeout. The processes are running, but a kill -TERM won't kill them. kill -9 helps.
I have had this today. Fortunately, in a reproducable setup. The analysis revealed: Zope got a SIGSEGV because the C runtime stack overflew (application bug). One of the Zope threads died as a result of the SIGSEGV. The others entered an unhealthy state: process "1" became their parent, they could only be attached by "gdb" as user "root", they held Zope's ports open, they could only be killed by "kill -9". Shane shed some light on the situation: Python blocks signals in threads other than the main thread. Thus, only the main thread receives signals (other than SIGSEGV) This is documented behaviour but arguably wrong for SIGSEGV (and some further signals). -- Dieter
On Mon, 22 Mar 2004, Dieter Maurer wrote:
Zope got a SIGSEGV because the C runtime stack overflew (application bug).
One of the Zope threads died as a result of the SIGSEGV. The others entered an unhealthy state: process "1" became their parent, they could only be attached by "gdb" as user "root", they held Zope's ports open, they could only be killed by "kill -9".
Thank you for the information! Can you tell a bit more about what was wrong? Can this be triggered by bugs in Python scripts, ZPT or DTML? Or was it a bug in C extension code? I think I'm going to have a hard time tracking this down, so any clues could help. I'm considering updating Zope to 2.7.0 (from 2.6.4) with Python 2.3.3 in the hope that it might help. Is this futile?
Shane shed some light on the situation:
Python blocks signals in threads other than the main thread. Thus, only the main thread receives signals (other than SIGSEGV)
This is documented behaviour but arguably wrong for SIGSEGV (and some further signals).
Should this issue be considered a bug in either Python or Zope? It doesn't seem like very robust behavior. If it only happens when debugging then that's passable, but for me it happens on an almost-production setup during normal use. Of course if it's just a bad C extension (I don't have many, just PIL, egenix-mx-base and egenix-mx-experimental) then there's not that much Zope or Python can do about it. -Osma -- *** Osma Suominen *** ozone@iki.fi *** http://www.iki.fi/ozone/ ***
Osma Suominen wrote at 2004-3-23 12:05 +0200:
On Mon, 22 Mar 2004, Dieter Maurer wrote:
Zope got a SIGSEGV because the C runtime stack overflew (application bug). ... Can you tell a bit more about what was wrong?
It was a bug in my application which caused the C runtime stack to overflow.
Can this be triggered by bugs in Python scripts, ZPT or DTML?
It can be caused by anything that recurses. Usually, the Python interpreter raises a "RuntimeError, maximal recursion depth exceeded" before an overflow of the runtime stack but I had set the recursion limit very generously.
... I'm considering updating Zope to 2.7.0 (from 2.6.4) with Python 2.3.3 in the hope that it might help. Is this futile?
I am using Zope 2.7.0. This means, Zope 2.7.0 can hang very similar as you have described it for Zope 2.6.
Shane shed some light on the situation:
Python blocks signals in threads other than the main thread. Thus, only the main thread receives signals (other than SIGSEGV)
This is documented behaviour but arguably wrong for SIGSEGV (and some further signals).
Should this issue be considered a bug in either Python or Zope?
I think it is a Python bug. I posted in comp.lang.python. I do not yet know what people replied. -- Dieter
On Tue, 23 Mar 2004, Dieter Maurer wrote:
It can be caused by anything that recurses. Usually, the Python interpreter raises a "RuntimeError, maximal recursion depth exceeded" before an overflow of the runtime stack but I had set the recursion limit very generously.
Did you increase the recursion limit yourself? As I understand it, it defaults to a sane, system-dependent value so the SIGSEGV should never happen unless you tweak the limit yourself.
I am using Zope 2.7.0. This means, Zope 2.7.0 can hang very similar as you have described it for Zope 2.6.
That's bad news. But at least I know that won't help so I can try something else. I haven't seen the problem for some time now, which makes it kinda hard to debug.
I think it is a Python bug. I posted in comp.lang.python. I do not yet know what people replied.
Can't see any replies so far, but I'll follow the thread. Thanks for you input! -Osma -- *** Osma Suominen *** ozone@iki.fi *** http://www.iki.fi/ozone/ ***
Osma Suominen wrote at 2004-3-24 01:56 +0200:
On Tue, 23 Mar 2004, Dieter Maurer wrote:
It can be caused by anything that recurses. Usually, the Python interpreter raises a "RuntimeError, maximal recursion depth exceeded" before an overflow of the runtime stack but I had set the recursion limit very generously.
Did you increase the recursion limit yourself? As I understand it, it defaults to a sane, system-dependent value so the SIGSEGV should never happen unless you tweak the limit yourself.
I wrote "I had set the recursion limit ...". This is the answer to your question. You got it before you asked... -- Dieter
On Wed, 24 Mar 2004, Dieter Maurer wrote:
I wrote "I had set the recursion limit ...". This is the answer to your question. You got it before you asked...
Sorry. You weren't very explicit so I had to double-check. I've found some clues about what is going on. It seems I've been bitten by the lack of thread safety in Gadfly. With a suitable test using apache-bench to make concurrent requests that modify the database, I'm able to cause Zope to hang the same way you did with infinite recursion. With the same page, but the database query disabled, there are no problems. I guess I'll have to switch to a "real" database. But it's worrying that lack of thread safety in Gadfly is not mentioned almost anywhere. I was able to find some mailing list archives from 1999 and a rant in German, but nothing in the Gadfly docs or Zope Book, which has some examples of Gadfly usage. I added a comment into the Zope Book. -Osma -- *** Osma Suominen *** ozone@iki.fi *** http://www.iki.fi/ozone/ ***
Osma Suominen wrote:
I've found some clues about what is going on. It seems I've been bitten by the lack of thread safety in Gadfly.
Gladfly really should be stamped with bigger warnings :-/ It's only there for testing, and should NEVER be used in anything approching production code... Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk
participants (3)
-
Chris Withers -
Dieter Maurer -
Osma Suominen