RE: [Zope] highly available Zope thread; our hanging problem
Hi, I'd like to comment on this, and summarise some references below. Much of this discussion took place on the zope-dev list (see references), so I'm cc'ing the zope-dev list. You might also wish to add to the Wiki: http://www.zope.org/Members/tseaver/Projects/HighlyAvailableZope/.
-----Original Message----- From: Brian Takashi Hooper [mailto:brian@garage.co.jp] Sent: 06 June 2000 12:11 To: zope@zope.org Subject: [Zope] highly available Zope thread; our hanging problem
Hi all -
I was looking at the discussion from April that was posted on the HighlyAvailableZope Wiki about problems with Zope hanging; we had a similar situation here at Digital Garage which seemed to be alleviated by changing the zombie_timeout to be really short (like, 1 minute). Before changing the zombie_timeout, the server would periodically hang and not give any responses to requests, sometimes recovering after a short time.
Some questions at this point: 1. Were you running with multiple threads, and if so, how many? 2. If you were using multiple threads, would *all* the threads periodically hang, or was the hanging isolated to a single thread at a time? 3. Could you possibly comment on the operating system used? 4. Which zombie_timeout did you twiddle -- the one in the zhttp_channel in ZServer.py, or that in http_channel in medusa/http_server.py?
At this point, I don't have anything more than just an empirical observation - changing this parameter seemed to help our server. Has anyone else noticed anything similar, or can explain this observation?
Concerning the zombie_timeout suggestion, here are some references when I posed the question of whether reducing the value would be beneficial: Amos Lattier wrote in http://lists.zope.org/pipermail/zope-dev/2000-April/004194.html:
The ZServer zombie stuff is to get rid of zombie client connections, not zombie publishing threads. These are quite different beasts.
Michel Pelletier wrote in http://lists.zope.org/pipermail/zope-dev/2000-April/004229.html:
What the Zombie timeout means is that after a publishing thread gets done answering a request, the socket may not go away. This many for a a number of reasons, the client 'hung' and is not 'putting down the phone after the converstation is over' (so to speak) or network troubles may prevent the connection from closing properly. This means that there is a 'zombie' connection laying around. This zombie will probably end up going away on its own, but if not, ZServer will kill it after a period of time.
The only reasorce laying around during the life of a Zombie is an tiny little unused open socket, the Mack truck of a Zope thread that served the request for the zombie socket does not 'hang' for that entire period of time, but goes on after it has completed the request to serve other requests.
Amos is correct in that these problems are almost always at the Application level, and not at the ZServer level. The fact that Pavlos can prevent hanging by inserting a print statement in the asyncore loop[*] is suspicious, but we do not have enough information yet to point fingers anywhere.
[* references http://lists.zope.org/pipermail/zope/2000-April/023697.html] I'd be _very_ interested in hearing more on this! Our Zope installation has been pretty stable of late (isn't it strange that, when you want to find out what's causing things to break, they play nice?), with uptime of thirty-something days, but I'm still very keen to get to the bottom of this, since I don't believe it was some ephemeral problem. hth, and thanks again! -- Marcus
On Tue, 6 Jun 2000 15:19:29 +0200 Marcus Collins <mcollins@sunesi.com> wrote:
Hi,
I'd like to comment on this, and summarise some references below. Much of this discussion took place on the zope-dev list (see references), so I'm cc'ing the zope-dev list. You might also wish to add to the Wiki: http://www.zope.org/Members/tseaver/Projects/HighlyAvailableZope/. OK, that's a good suggestion!
-----Original Message----- From: Brian Takashi Hooper [mailto:brian@garage.co.jp] Sent: 06 June 2000 12:11 To: zope@zope.org Subject: [Zope] highly available Zope thread; our hanging problem
Hi all -
I was looking at the discussion from April that was posted on the HighlyAvailableZope Wiki about problems with Zope hanging; we had a similar situation here at Digital Garage which seemed to be alleviated by changing the zombie_timeout to be really short (like, 1 minute). Before changing the zombie_timeout, the server would periodically hang and not give any responses to requests, sometimes recovering after a short time.
Some questions at this point: 1. Were you running with multiple threads, and if so, how many?
Yes; Zope is set to run with 16 threads (-t 16), and we've increased the pool_size parameter in ZODB/DB.py to 16 also (guess this is all right... :-P )
2. If you were using multiple threads, would *all* the threads periodically hang, or was the hanging isolated to a single thread at a time?
All the threads hang. One interesting thing is, we looked at vmstat and whenever the system is having trouble, the number of system calls drops dramatically, when the server is doing well it's normally up in the 1000s, but when it's in trouble there are like 20-30 system calls per second, and they're all either lwp_* or poll s.
3. Could you possibly comment on the operating system used?
Solaris, 2.6, on netras. Our Zope is still v. 2.1.4.
4. Which zombie_timeout did you twiddle -- the one in the zhttp_channel in ZServer.py, or that in http_channel in medusa/http_server.py?
The one in zhttp_channel. As far as I can tell, since zhttp_channels are actually used instead of http_channels, the number in zhttp_channel is the one that matters. The kill_zombies method, and the code that calls it, is inherited from the medusa code... kill_zombies looks at the timeout value of all the channels in the select list, and since all of those instances happen to be zhttp_channels in the case of Zope, they all use the zhttp_channel timeout.
At this point, I don't have anything more than just an empirical observation - changing this parameter seemed to help our server. Has anyone else noticed anything similar, or can explain this observation?
Concerning the zombie_timeout suggestion, here are some references when I posed the question of whether reducing the value would be beneficial:
Amos Lattier wrote in http://lists.zope.org/pipermail/zope-dev/2000-April/004194.html:
The ZServer zombie stuff is to get rid of zombie client connections, not zombie publishing threads. These are quite different beasts.
Michel Pelletier wrote in http://lists.zope.org/pipermail/zope-dev/2000-April/004229.html:
What the Zombie timeout means is that after a publishing thread gets done answering a request, the socket may not go away. This many for a a number of reasons, the client 'hung' and is not 'putting down the phone after the converstation is over' (so to speak) or network troubles may prevent the connection from closing properly. This means that there is a 'zombie' connection laying around. This zombie will probably end up going away on its own, but if not, ZServer will kill it after a period of time.
The only reasorce laying around during the life of a Zombie is an tiny little unused open socket, the Mack truck of a Zope thread that served the request for the zombie socket does not 'hang' for that entire period of time, but goes on after it has completed the request to serve other requests.
Amos is correct in that these problems are almost always at the Application level, and not at the ZServer level. The fact that Pavlos can prevent hanging by inserting a print statement in the asyncore loop[*] is suspicious, but we do not have enough information yet to point fingers anywhere.
[* references http://lists.zope.org/pipermail/zope/2000-April/023697.html]
Yeah, I saw this... like I said, I haven't gathered enough information yet to be able to say anything that sounds like an explanation; all I have is a vague experimental observation. I found out about the mpstat command on Solaris, didn't know about it before, it gives you info on thread activity and multiprocessor behavior, so maybe I can get some more info from that. Hmm. --Brian Hooper
participants (2)
-
Brian Takashi Hooper -
Marcus Collins