RE: [Zope] zope_msg.log Message
Hi guys - Have you sent anything about this to support@digicool.com? If not, please do send a report that describes your environment and how to reproduce the problem. Stability problems are a high priority for us, and if we have an easily reproducable one here, I'd really like Michel to look into it if he hasn't already... Thanks! Brian Lloyd brian@digicool.com Software Engineer 540.371.6909 Digital Creations http://www.digicool.com
-----Original Message----- From: chas [mailto:panda@skinnyhippo.com] Sent: Wednesday, March 22, 2000 8:44 PM To: Tony Rossignol; Kevin Littlejohn; zope@zope.org Subject: Re: [Zope] zope_msg.log Message
I can reproduce it instantly with a large enough page - stop the request mid-way, the closing of the socket to the client while Zope is serving the page up seems to be what causes it to go into it's loop, then die.
Is this only when accessing via FastCGI or does happen when using ZServer as well?
Just for the record, I dont' use FastCGI and have seen Zope die under the same circumstances; ie. by disconnecting the browser mid-download of a (usually long) page. Fortunately, that 'client' has been me via the administration screens, but it does seem like a rather trivial DoS. But it seems that this only happens with certain types of pages - I'm not sure which.
Unfortunately, I can't remember if this was when we were using PCGI or Proxy-Pass, but it was definitely one of the two.
Hearing that others are now experiencing the same gives me hope: this could explain the crashes that we've been plagued with for months (though I fear any solution will come too late to prevent our switch to an alternative platform).
chas
ps. Weird thing : the most stable Zope installation I have is on NT. It never, ever crashes, despite being the development machine (I wouldn't say that load is causing my production server to die b/c it happens under low load too).
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
Brian Lloyd wrote:
Hi guys -
Have you sent anything about this to support@digicool.com? If not, please do send a report that describes your environment and how to reproduce the problem. Stability problems are a high priority for us, and if we have an easily reproducable one here, I'd really like Michel to look into it if he hasn't already...
Aye carumba, I haven't even been following this thread, not the most useful subject line! Ok, it looks like there are two issues: 1) some sort of FCGI issue which I do not have the capacity to reproduce 2) the same stability issue I am currently looking into. The problem with 2) is that no one can reproduce it reliably (this may not be true after I look at something someone sent me today, I'll keep everyone updated) including myself. I have pounded a half dozen sandboxes with ab for hours and gotten nowhere. No crashes. Not a single burp. I'm, of course still looking into this, but I need more data from the folks out there. First of all, if you suspect that your Zope is dying and restarting alot, run it in debug mode like this (unix only, sorry): bash$ strace -f -e trace=none -signal=all python z2.py -D See 'man strace' for the gory details. obviously tune the parameters to z2.py to suite your needs. What we are looking for here are SIGSEGV. You'll probably see a lot of SIGUSR1s, these are normal. If you can get a SIGSEGV to happen reliably, _please_ let me know. -Michel
Michel Pelletier wrote:
The problem with 2) is that no one can reproduce it reliably (this may not be true after I look at something someone sent me today, I'll keep everyone updated) including myself. I have pounded a half dozen sandboxes with ab for hours and gotten nowhere. No crashes. Not a single burp.
We have very erratic crashes, the only commonality we can find is volume. When requests are low the system is awesome, but once load starts increasing we see frequent restarts, once an hour at least. This could be due to people experience slower response times and breaking the connection to attempt a reload. From what someother people have been saying a 'broken pipe' error in zope_msg.log may cause all the threads to die off. The problem here is I generated 30-40 'broken pipe' errors and could not crash Zope. When you run ab against it are you hitting a wide variety of objects or the same page over and over. I've had Zope perform great accessing the same page repeatedly but show some strain once you vary it's diet a bit especially when cached SQL results are involved. What server method are you using on the sandboxes. We use a combination of ZServer and FastCGI connections for our traffic, with all /manage going through FastCGI and the actual public traffic using a mix of FastCGI & ZServer.
bash$ strace -f -e trace=none -signal=all python z2.py -D
I can't get this to work. When I try it all the messages come on the screen like they should, each service that is up etc, but the darn thing will not answer requests. As soon as I run the exact same command w/o strace it works. I've also tried to attach strace directly to the already running processes, and no luck. As soon as I'm able to collect more info I'll forward it to you. Is there anywhere else I should be posting this information? -- ------------------------------- tonyr@ep.newtimes.com Director of Web Technology New Times, Inc. -------------------------------
On Fri, 24 Mar 2000, Tony Rossignol wrote:
As soon as I'm able to collect more info I'll forward it to you. Is there anywhere else I should be posting this information?
Please send any new info to all of us. I could not get strace to work either but what is puzzling, I will get the problem even on a no load site. Pavlos
Tony Rossignol wrote:
Michel Pelletier wrote:
The problem with 2) is that no one can reproduce it reliably (this may not be true after I look at something someone sent me today, I'll keep everyone updated) including myself. I have pounded a half dozen sandboxes with ab for hours and gotten nowhere. No crashes. Not a single burp.
We have very erratic crashes, the only commonality we can find is volume. When requests are low the system is awesome, but once load starts increasing we see frequent restarts, once an hour at least. This could be due to people experience slower response times and breaking the connection to attempt a reload. From what someother people have been saying a 'broken pipe' error in zope_msg.log may cause all the threads to die off. The problem here is I generated 30-40 'broken pipe' errors and could not crash Zope.
When you run ab against it are you hitting a wide variety of objects or the same page over and over. 'm u
I'm using ab to nail one page and wget to pound the whole site, including the managment interface.
I've had Zope perform great accessing the same page repeatedly but show some strain once you vary it's diet a bit especially when cached SQL results are involved.
What server method are you using on the sandboxes. We use a combination of ZServer and FastCGI connections for our traffic, with all /manage going through FastCGI and the actual public traffic using a mix of FastCGI & ZServer.
I've been testing strictly ZServer.
bash$ strace -f -e trace=none -signal=all python z2.py -D
I can't get this to work. When I try it all the messages come on the screen like they should, each service that is up etc, but the darn thing will not answer requests.
odd. Works for me.
As soon as I run the exact same command w/o strace it works. I've also tried to attach strace directly to the already running processes, and no luck.
:/
As soon as I'm able to collect more info I'll forward it to you. Is there anywhere else I should be posting this information?
The list. just keep ccing me. -Michel
On Fri, 24 Mar 2000, Michel Pelletier wrote:
As soon as I'm able to collect more info I'll forward it to you. Is there anywhere else I should be posting this information?
The list. just keep ccing me.
Some good news at last ... When I set DEBUG in asyncore.py to 1 so I could view the lists going into select, ZServer stabilised and hasn't crashed since. Smells like a race condition and somehow the extra time it takes to print the list contents stabilises things. Still I cannot understand how the child process causes the supevising (zdaemon) process to die too. Pavlos
Pavlos Christoforou wrote:
On Fri, 24 Mar 2000, Michel Pelletier wrote:
As soon as I'm able to collect more info I'll forward it to you. Is there anywhere else I should be posting this information?
The list. just keep ccing me.
Some good news at last ...
When I set DEBUG in asyncore.py to 1 so I could view the lists going into select, ZServer stabilised and hasn't crashed since. Smells like a race condition and somehow the extra time it takes to print the list contents stabilises things.
This might be a seperate problem than The Mysterious Segment Violation. Can race conditions cause SEG faults? , I guess they can like any other piece of code. but I would expect a race condition to just spin the process. Can someone who I've spoken with about their SIGSEV problem reproduce Pavlos' cure?
Still I cannot understand how the child process causes the supevising (zdaemon) process to die too.
This makes me think it's a different problem also. I get the feeling you should be able to reproduce this problem on a fresh checkout on your platform since it's so low level. Can you check that? -Michel
On Sat, 25 Mar 2000, Michel Pelletier wrote: :Pavlos Christoforou wrote: :> :> On Fri, 24 Mar 2000, Michel Pelletier wrote: :> :> > :> > > As soon as I'm able to collect more info I'll forward it to you. Is :> > > there anywhere else I should be posting this information? :> > :> > The list. just keep ccing me. :> > :> :> Some good news at last ... :> :> When I set DEBUG in asyncore.py to 1 so I could view the lists going into :> select, ZServer stabilised and hasn't crashed since. Smells like a race :> condition and somehow the extra time it takes to print the list contents :> stabilises things. : :This might be a seperate problem than The Mysterious Segment Violation. :Can race conditions cause SEG faults? , I guess they can like any other :piece of code. but I would expect a race condition to just spin the :process. Can someone who I've spoken with about their SIGSEV problem :reproduce Pavlos' cure? i set DEBUG=1 in $ZOPEHOME/ZServer/medusa/asyncore.py deleted asyncore.pyc and restartet my zope but had SIGSEV nevertheless. can i somehow verify that that DEBUG did anything at all? peter. -- _________________________________________________ peter sabaini, mailto: sabaini@niil.at -------------------------------------------------
On Mon, 27 Mar 2000, Peter Sabaini wrote:
deleted asyncore.pyc and restartet my zope but had SIGSEV nevertheless. can i somehow verify that that DEBUG did anything at all?
You should be running Zope with -D option and no -Z option, ie not as a deamon. Then you will be able to see the connections that come in and out. Seems though that there are two (or more) problems. Pavlos
participants (5)
-
Brian Lloyd -
Michel Pelletier -
Pavlos Christoforou -
Peter Sabaini -
Tony Rossignol