[Zope] Folder with one million Documents?

Joachim Werner joe@iuveno-net.de
Sun, 27 Jan 2002 03:14:24 +0100


Hi!

Just my 2 eurocents:

> I am developing a simple DMS. Up to now I use a python product with a
> BTreeFolder which
> contains all the documents. Every document gets an ID with
> DateTime().millis(). There will
> be up to 50 users working at the same time. And in the end I will have
> up to 3 million documents.
>
> Is there a better class than BTreeFolder for such mass storage?

If it is mainly large documents (like MS Office or PDF files) you are trying
to manage, the fastest way of handling this is using the filesystem for
storage and serving. You could do the cataloging in Zope and hold link
objects to the actual files in a Zope tree (and yes, if it is MANY objects,
BTrees will be a good idea). These links could also manage the metadata.

For the actual file serving, you'd use Apache (or if you can, SMB via Samba
in the Intranet).

I did some Benchmarks of Zope's input/output performance a couple of months
ago. On a rather old Solaris machine (which has great IO throughput, but
rather poor CPU performance), Apache could serve files almost at "wire
speed", so the Ethernet card was the bottleneck. But Zope took much longer
and consumed a lot of system resources.

However, there is one important caveat with using Apache + Filesystem or
Samba: You'll have to make sure that the files are secured by Apache, as
Zope can not protect them on the file system level with the Zope security
engine.

I can't really see how an RDBMS would help you with performance. You'd need
something professional, like Oracle (though PostgreSQL might do the job,
too), and those servers eat RAM for breakfast and like fast CPUs. Of course,
if you can spen a lot of money, Oracle will be the only solution that scales
onto multiple servers. Then you could have almost any performance level you
need, but at a price.

It's a good question whether ZEO could help. As long as you have only one
main DB, certainly not. This one will always be the bottleneck for write
access. But of course you could put up separate DBs on separate servers,
e.g. have servers for each department. Those servers could do their own
indexing, and a centralized index server could retrieve the index
information from them. ZEO would also help with search requests, as the
index objects will be cached in the ZEO clients.

But I'd do some benchmarks first. Even if you have 50 concurrent users,
you'll probably not have 50 users posting docs at the same moment. Uploading
a document will certainly not be the problem. The most time-consuming task
will be the online indexing, so probably you'll have to forget about it and
do a delayed batch indexing at night.

Joachim