Okay, Chris, that's a lot to chew on. Thanks very much for your patience and time. -- Michael Fraase ARTS & FARCES LLC mfraase@farces.com www.farces.com PGP Fingerprint: 3D85 F3F4 9E65 4949 176A 260C CB47 190D C864 9A96
-----Original Message----- From: Chris McDonough [mailto:chrism@zope.com] Sent: Tuesday, November 20, 2001 3:39 PM To: mfraase@farces.com; 'Chris Withers' Cc: zope@zope.org Subject: Re: [Zope] Urgent help needed: Zope falls over under moderate load
I guess I'm confused. Everything that *could* be cached *was* cached. And no, I don't run a caching server or a proxy server or anything else in front of Zope. I'm a writer, not a programmer.
OK, fair enough.
But your profession still doesn't absolve you from needing to cache more in order to survive a Slashdotting. ;-) Either that or you'll need to start developing your site with static pages only. That'd work too.
The /. piece hit about 1:00 AM. By 1:01 AM Zope had folded like a cheap suit. It's still going down about every 40 minutes or so.
Now remember, my outbound bandwidth is limited to 512Kb.
If 512Kb/s is hit by as many 300-byte requests per minute as possible, this translates into without taking into account latency or response usage a potential inbound rate of 213 requests per second. That's still a lot of requests. As something to measure that up against at peak normal load, Slashdot gets about 180 requests/sec. The 512Kb/s isn't much of a throttle.
And this is assuming that your inbound bandwidth is limited to 512Kb/s.. you only mentioned your outbound in this mail. If inbound is higher, it's even more of a problem.
Am I correct in my understanding that Zope can't handle even 512Kb of demand without some technical doohickey in front of it so it doesn't fall down?
Your pipe is fat enough to allow lots of requests in, and what you're serving is probably sufficiently complex to be very slow. Squishdot is really not known for its speed.
"Raw" Zope itself could almost certainly handle it, however, if what you were returning is a DTML method that said "<html>this is a simple page</html>". But this isn't what you're returning; Squishdot has a big say in what shows up.
No offense intended, but I think two internal Squishdot pages meet the definition of pretty dang simple.
Maybe conceptually it's simple, but apps like Squishdot do lots of stuff in order to generate these pages. For fun, you should try to set up a "barebones" Squishot with the default homepage, and hit it repeatedly with a load-generator like Apache's "ab". Then try the same thing with a Zope page that is "<html>Hello!</html>". You will see a big difference. On an 850Mhz box at ZC, I can get Zope to serve about 152 requests/s with the simple page.
Anybody want to try this with an out of the box Squishdot homepage? Or a Squishdot story page? The guy from the KDE dot (http://dot.kde.org) claimed he could only get about 2 requests/second out of a Squishdot home page. After setting up caching properly, he was able to get about 2000.
And why does it fall over anyway? This just doesn't make any sense to me. I can see it getting slow and timing out, but giving up completely and just bailing? What's that about? Explain it to me like I'm an intelligent, non-technical friend. Thanks.
The big "bang for buck" solution provider is caching. Assuming that you had no problems *before* the slashdotting, that will solve your problem because it will cause Zope to need to serve far fewer requests, closer to the number of requests you normally get. And this is (I assume) the outcome that you actually want. I highly recommend setting up a caching proxy in front of Zope if this sort of load will be recurring. It's way faster and cheaper than trying to understand the problem deeply. ;-) Most commercial sites are developed using this principle, AFAICT.
But if you're as interested in understanding the phenomena as you are in solving the problem and you'd like to help the current Squishdot maintainer and ZC improve their products' behavior under load, it'd be necessary to know more details about how it was failing under load and what happened during the failures. I would be interested in these results. It could be a memory leak, it could be a Zope bug, a Squishdot bug, it could be just about anything. You need forensic information and you need to let it fail under load in order to get it.
Usually, you can get this info by turning on "big M" logging (by passing "-M detailed.log" at the end of your start.bat script, maybe). On Linux, I'd recommend also using the ForensicLogger product (see http://www.zope.org/Members/mcdonc) to gather more details such as memory utilization and CPU utilization; it doesn't work on Windows, however. If you're willing to do this, let it fail under load, then send the log with the failure in it to me and I will try to analyze it.
Note that you *might* be able to make use of the AutoLance product at http://www.zope.org/Members/mcdonc to autorestart your machine for you if you've got a memory leak.
HTH,
- C