On Mon, 2002-03-04 at 08:48, Matthew T. Kromer wrote: ...
This issue has been discussed again and again,
I would like to clarify my idea and your comments will be very appreciated.
Suppose we want to provide a server which is:
1) Hosting 1,000,000 members' profile. Each member's disk quota is 5MB.
Which means we need at least 5,000GB (5 TeraGB) disk space.
2) Assume the concurrent accessing to a URL is 1000 request/per second
3) Assume all the requests retrieve dynamic content.
4) We want to leverage the power of Zope which means all the pages should be
rendered by zope.
All the pages, sure, but you *really* should consider using an Image server if you will have any images. I agree with Matthew that your load estimate is low.
Having 5 TB of disk space usually means some very high-powered RAID gear; my personal favorite is the EMC Symmetrix units; I think you would probably want at least two of those to provide your coverage. Estimated cost for this is about $5,000,000 (but *very* dependant on EMC's pricing strategies).
You do realize that the disk user quote will *not* work with zope, right (no, not aimed at you, Matt)? I do not know of any in-Zope user storage quotas, so if you want that, you may have to roll your own or contract out for it. Someone please prove me wrong on that one. ;^) Well, as someone who works in this particular field, I have a slight bias in product choice (but then again, remember I *test* these things for a living), but I do have some comments about the hardware/fs layout. For something this size, I would highly recommend a Linux setup against a set of VA7400 units from HP. Much less expensive than EMC (you should be able to pick up a unit with ~7.7TB for around a 250,000-750,000 USD or so, depending on the reseller, etc.), and IMO, more useful. in fact, if you have the budget for the EMCs noted above, I would get two VA7400 units, fully configured, and run them in a Linux RAID-1 array for extra redundancy. :) (No, HP doesn't officially support that configuration, but I have done it, and it is fairly sweet. ;) ) I would recommend you use an XFS filesystem, five of them at 1TB, actually. Then you would arrange your mounting hierarchy to account for that.
You could get by for less, by distributing each disk with each CPU (the breadrack approach.)
There are other ways, too. :)
1000 requests/second isnt terribly high; Zope installations have done 400/sec with no problem. However, these are in situations where Zope is being heavily cached; less than 10% of the requests are actually being rendered by Zope. So, if you wanted no caching (ala everything is
We are talking a ZEO/Zope setup, right? What were the Zope Server node specs, and how many? :^) (yes, I am "writing a book" ;^) )
completely dynamic content), my estimate is you would need something like 100 1Ghz Intel Pentium III class machines to perform that amount of dynamic rendering. If each of those machines had a 50 GB disk drive, you'd theoretically have your 5TB of disk space. At a rough commercial cost of $4,000 per unit (probably a bit high), that's only $400,000.
As a practical matter, you'd then need some pretty hefty load balancing servers; at least two, possibly more.
However, that begs the question of how you organize that much disk space. It's not an easy task. Whether or not you use an RDBMS is irrelevant until you can work out a strategy for using the disk space scattered amongst all of those machines.
You could use the FibreChannel VA7400's, with Switch, to provide shared storage for a pair of ZODB servers, one in standby mode, for ZSS failover capability. Then, you could hook up with RLX Technologies, and purchase a few fully packed blade chassis units. With four full 324ex systems, rhis would provide you with 96 Zope Servers to load balance across. With each of them talking to a failover-ready ZODB Storage Server, the issue of arranging disk space is avoided (after configuring the server). IMO, this should easily handle about 1600 requests/sec (rps). This solution would be much cheaper, as you can obtain a pair of full RLX324ex units for slightly more than (or less than, depending on the deal) than 40,000 (USD). In addition, such a setup would provide hot-replacement and building of additional servers. Each node on the Zope Server cluster could run Apache locally. With the control tower software capabilities in the RLX system, you would build the system once, and push the image out to the remaining systems. (see, I'm not HP-biased, just product biased in Arrays ;^), HP makes blade units, but IMO, they suck) A total of six full 324 units would run ~250,000 USD (I just did a web-check, and that comes to 252,219 USD. each clusternode was 1/4 GB RAm, with an 800MHz proc), and provide 288 Zope Servers, which should be able to handle to load quite well. If a given Zope server could push 25 rps, that provides you with approximately 7175 rps capacity. If you assume your code is complex, resulting in lower rps rates, we can figure 15 rps per server, resulting in ~4300 rps capacity (or roughly 11 billion requests/month). if you implement a TUX-based Image server, this should handle your load quite well. with the money saved by using VA7400's versus Symmetrix arrays, and the rather low cost of the cluster blades, you can scale your Server farm quite a bit, in theory. IMO, the unknown here, is how well the ZEO Storage Server would perform with that many clients. Matt, any idea? The ZSS, IMO, seems to be the real bottleneck in massively scaled ZEO/Zope solutions.
What is most unrealistic about this scenario is your assumptions about the member base, and its ratio to expected activity. One million users may only generate 1,000 requests/sec, but they certainly could generate a lot more. In fact, a critical strategy for large systems like this is anticipating for "peak demand" events. Lets say you send an e-mail out to all million people, telling them to log in and check out a particular URL. That timed event will generate a demand curve that is not evenly distributed over time; in fact, it is usually very front-loaded. Within about 5 minutes, more than 10% of the user base will probably respond. This is a raw rate of about 333 requests/sec, but that presumes that the single URL is the only thing they load; usually, a page contains images and other content (style sheets etc) which also much be fetched. Pages with a high art content can have 25 elements or more on them. That pushes the request rate up to 8333 requests/sec; way out of the 1000 request/sec bound.
Here, one would have to look at a separate server for images. Ask yourself this: Is there compelling reason to store the images in the ZODB at that level of demand? One could put the images on a single TUX server and improve the performance of the cluster dramatically. If the load was beyond TUX's capacity (On a 2.4 kernel, Tux has been able to achieve over 12,000 transactions/sec.), one could use the two ZSS nodes to server up the images using tux, since they share the same storage. They would both mount that filesystem in read-only mode. If the images *had* to be in the ZODB, I suppose one could use cache-management to cache the images, but for something that scalable, the TUX route may be the better option, IMO.
The priciples I would like to verify are:
1) Some database (RDBMS) should be used instead of FileStorage for ZODB.
Your database needs/capabilities should be the determination here. I have a ZODB in use that is approaching the 1TB mark.
2) The ZEO should be used for constructing a cluster computing.
Without question, yes.
3) The Apache should be the front end instead of ZServer.
4) The PCGI should be the connection between Apache and Zope.
I would challenge that one. It is my understanding that PCGI slows things down. I'd have to look back into the archives to verify, however. I moved to mod_proxy myself. I seem to recall it being much faster than PCGI.
5) I shouldn't create static instance into the ZODB but query the
external database.
Now, this is not the way I understood the earlier comment regarding storing the Zope DB in an RDBMS. The use of an external SQL DB as opposed to ZODB is a system design issue, IMO.
6) The "Cache" of zope is useless since all the responses are dynamic rendered.
Not completely, unless the site would have no images, or all images were stored separately, as suggested above. Furthermore, there are other caches, that can cache the results of more intensive methods, where the data may change, but somewhat infrequently. This speeds things up considerably.
By the way, how much will this kind of system costs, regardless the hardware?
Can't say; it is entirely dependent on the hardware, since that is what you are paying for -- unless you hire a consultant. Hypothetically speaking, if you went the HP/RLX route, you could easily handle your load expectations (revised) for under 2,000,000 USD, IMO. That, and a bit of effort and know-how, should provide a system with failover ZSS, redundant arrays (using a single array would dramatically decrease costs, as would getting a good deal on the VA7400's), and a monstrous sized cluster (though not physically, we are talking two standard 42U racks here for the whole thing, one if you go with a single array). your ongoing power costs (if you are physically hosting your datacenter, of course) are quite low, given the capacity. Hope this helps. Besides, it is time to get back to my app development for the night. :) Bill -- Bill Anderson Linux in Boise Club http://www.libc.org Amateurs built the Ark, professionals built the Titanic. Amateurs build Linux, professionals build Windows(tm).