Re: [Zope] The Principles of Designing a Heavy Load Site with Zope

5 Mar 2002

      On Mon, 2002-03-04 at 08:48, Matthew T. Kromer wrote:

...
...
...
This issue has been discussed again and again,
I would like to clarify my idea and your comments will be very
appreciated.
Suppose we want to provide a server which is:
1) Hosting 1,000,000 members' profile. Each member's disk quota is 5MB.
Which means we need at least 5,000GB (5 TeraGB) disk space.
2) Assume the concurrent accessing to a URL is 1000 request/per second
3) Assume all the requests retrieve dynamic content.
4) We want to leverage the power of Zope which means all the pages
should be
rendered by zope.
All the pages, sure, but you *really* should consider using an Image
server if you will have any images. I agree with Matthew that your load
estimate is low.
...
Having 5 TB of disk space usually means some very high-powered RAID
gear; my personal favorite is the EMC Symmetrix units; I think you would
probably want at least two of those to provide your coverage. Estimated
cost for this is about $5,000,000 (but *very* dependant on EMC's pricing
strategies).
You do realize that the disk user quote will *not* work with zope, right
(no, not aimed at you, Matt)? I do not know of any in-Zope user storage
quotas, so if you want that, you may have to roll your own or contract
out for it. Someone please prove me wrong on that one. ;^)

Well, as someone who works in this particular field, I have a slight
bias in product choice (but then again, remember I *test* these things
for a living), but I do have some comments about the hardware/fs layout.

For something this size, I would highly recommend a Linux setup against 
a set of VA7400 units from HP. Much less expensive than EMC (you should
be able to pick up a unit with ~7.7TB for around a 250,000-750,000 USD
or so, depending on the reseller, etc.), and IMO, more useful. in fact,
if you have the budget for the EMCs noted above, I would get two VA7400
units, fully configured, and run them in a Linux RAID-1 array for extra
redundancy. :) (No, HP doesn't officially support that configuration,
but I have done it, and it is fairly sweet. ;) )

I would recommend you use an XFS filesystem, five of them at 1TB,
actually. Then you would arrange your mounting hierarchy to account for
that.
...
You could get by for less, by distributing each disk with each CPU (the
breadrack approach.)
There are other ways, too. :)
...
1000 requests/second isnt terribly high; Zope installations have done
400/sec with no problem. However, these are in situations where Zope is
being heavily cached; less than 10% of the requests are actually being
rendered by Zope. So, if you wanted no caching (ala everything is
We are talking a ZEO/Zope setup, right? What were the Zope Server node
specs, and how many? :^)

(yes, I am "writing a book" ;^) )
...
completely dynamic content), my estimate is you would need something
like 100 1Ghz Intel Pentium III class machines to perform that amount of
dynamic rendering. If each of those machines had a 50 GB disk drive,
you'd theoretically have your 5TB of disk space.
At a rough commercial cost of $4,000 per unit (probably a bit high),
that's only $400,000.
As a practical matter, you'd then need some pretty hefty load balancing
servers; at least two, possibly more.
However, that begs the question of how you organize that much disk
space. It's not an easy task. Whether or not you use an RDBMS is
irrelevant until you can work out a strategy for using the disk space
scattered amongst all of those machines.
You could use the FibreChannel VA7400's, with Switch, to provide shared
storage for a pair of ZODB servers, one in standby mode, for ZSS
failover capability. Then, you could hook up with RLX Technologies, and
purchase a few fully packed blade chassis units.

With four full 324ex systems, rhis would provide you with 96 Zope
Servers to load balance across. With each of them talking to a
failover-ready ZODB Storage Server, the issue of arranging disk space is
avoided (after configuring the server). IMO, this should easily handle
about 1600 requests/sec (rps).

This solution would be much cheaper, as you can obtain a pair of full
RLX324ex units for slightly more than (or less than, depending on the
deal) than 40,000 (USD). In addition, such a setup would provide
hot-replacement and building of additional servers. Each node on the
Zope Server cluster could run Apache locally. With the control tower
software capabilities in the RLX system, you would build the system
once, and push the image out to the remaining systems.  

(see, I'm not HP-biased, just product biased in Arrays ;^), HP makes
blade units, but IMO, they suck) 

A total of six full 324 units would run ~250,000 USD (I just did a
web-check, and that comes to 252,219 USD. each clusternode was 1/4 GB
RAm, with an 800MHz proc), and provide 288 Zope Servers, which should be
able to handle to load quite well. If a given Zope server could push 25
rps, that provides you with approximately 7175 rps capacity. If you
assume your code is complex, resulting in lower rps rates, we can figure
15 rps per server, resulting in ~4300 rps capacity (or roughly 11
billion requests/month). if you implement a TUX-based Image server, this
should handle your load quite well. with the money saved by using
VA7400's versus Symmetrix arrays, and the rather low cost of the cluster
blades, you can scale your Server farm quite a bit, in theory.

IMO, the unknown here, is how well the ZEO Storage Server would perform
with that many clients. Matt, any idea? The ZSS, IMO, seems to be the
real bottleneck in massively scaled ZEO/Zope solutions.
...
What is most unrealistic about this scenario is your assumptions about
the member base, and its ratio to expected activity. One million users
may only generate 1,000 requests/sec, but they certainly could generate
a lot more. In fact, a critical strategy for large systems like this is
anticipating for "peak demand" events. Lets say you send an e-mail out
to all million people, telling them to log in and check out a particular
URL. That timed event will generate a demand curve that is not evenly
distributed over time; in fact, it is usually very front-loaded. Within
about 5 minutes, more than 10% of the user base will probably respond.
This is a raw rate of about 333 requests/sec, but that presumes that the
single URL is the only thing they load; usually, a page contains images
and other content (style sheets etc) which also much be fetched. Pages
with a high art content can have 25 elements or more on them. That
pushes the request rate up to 8333 requests/sec; way out of the 1000
request/sec bound.
Here, one would have to look at a separate server for images. Ask
yourself this: Is there compelling reason to store the images in the
ZODB at that level of demand? One could put the images on a single TUX
server and improve the performance of the cluster dramatically. If the
load was beyond TUX's capacity (On a 2.4 kernel, Tux has been able to
achieve over 12,000 transactions/sec.), one could use the two ZSS nodes
to server up the images using tux, since they share the same storage.
They would both mount that filesystem in read-only mode.

If the images *had* to be in the ZODB, I suppose one could use
cache-management to cache the images, but for something that scalable,
the TUX route may be the better option, IMO.
...
...
The priciples I would like to verify are:
1) Some database (RDBMS) should be used instead of FileStorage for ZODB.
Your database needs/capabilities should be the determination here. I
have a ZODB in use that is approaching the 1TB mark.
...
...
2) The ZEO should be used for constructing a cluster computing.
Without question, yes.
...
...
3) The Apache should be the front end instead of ZServer.
4) The PCGI should be the connection between Apache and Zope.
I would challenge that one. It is my understanding that PCGI slows
things down. I'd have to look back into the archives to verify, however.
I moved to mod_proxy myself. I seem to recall it being much faster than
PCGI.
...
...
5) I shouldn't create static instance into the ZODB but query the
external database.
Now, this is not the way I understood the earlier comment regarding
storing the Zope DB in an RDBMS. The use of an external SQL DB as
opposed to ZODB is a system design issue, IMO.
...
...
6) The "Cache" of zope is useless since all the responses are dynamic
rendered.
Not completely, unless the site would have no images, or all images were
stored separately, as suggested above. Furthermore, there are other
caches, that can cache the results of more intensive methods, where the
data may change, but somewhat infrequently. This speeds things up
considerably.
...
...
By the way, how much will this kind of system costs, regardless the
hardware?
Can't say; it is entirely dependent on the hardware, since that is what
you are paying for -- unless you hire a consultant. Hypothetically
speaking, if you went the HP/RLX route, you could easily handle your
load expectations (revised) for under 2,000,000 USD, IMO. That, and a
bit of effort and know-how, should provide a system with failover ZSS,
redundant arrays (using a single array would dramatically decrease
costs, as would getting a good deal on the VA7400's), and a monstrous
sized cluster (though not physically, we are talking two standard 42U
racks here for the whole thing, one if you go with a single array). your
ongoing power costs (if you are physically hosting your datacenter, of
course) are quite low, given the capacity.

Hope this helps. Besides, it is time to get back to my app development
for the night. :)

Bill

-- 
Bill Anderson
Linux in Boise Club                  http://www.libc.org
Amateurs built the Ark, professionals built the Titanic.
Amateurs build Linux, professionals build Windows(tm).