Folder with one million Documents?
Hi! I am developing a simple DMS. Up to now I use a python product with a BTreeFolder which contains all the documents. Every document gets an ID with DateTime().millis(). There will be up to 50 users working at the same time. And in the end I will have up to 3 million documents. Is there a better class than BTreeFolder for such mass storage? For the curious, here is one result of the benchmarks: I benchmark it with httperf: +5 requests per second +10MBit connection between client and server +every request creates a document Anzahl der Dokumente: 2159 httperf.exe: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE httperf.exe --timeout=5 --client=0/1 --server=prophet --port=8080 --uri=/a/benchmarks/create_new_doc --rate=5 --send-buffer=4096 --recv-buffer=16384 --add-header='Authorization: Basic em9wZTp6b3Bl\n' --num-conns=1000 --num-calls=1 Maximum connect burst length: 1 Total: connections 1000 requests 1000 replies 571 test-duration 204.814 s Connection rate: 4.9 conn/s (204.8 ms/conn, <=26 concurrent connections) Connection time [ms]: min 110.0 avg 977.2 max 5257.0 median 289.5 stddev 1315.9 Connection time [ms]: connect 0.5 Connection length [replies/conn]: 1.000 Request rate: 4.9 req/s (204.8 ms/req) Request size [B]: 120.0 Reply rate [replies/s]: min 0.0 avg 2.9 max 5.2 stddev 2.2 (40 samples) Reply time [ms]: response 976.4 transfer 0.0 Reply size [B]: header 216.0 content 79.0 footer 0.0 (total 295.0) Reply status: 1xx=0 2xx=562 3xx=0 4xx=0 5xx=9 CPU time [s]: user 83.96 system 117.15 (user 41.0% system 57.2% total 98.2%) Net I/O: 1.4 KB/s (0.0*10^6 bps) Errors: total 429 client-timo 429 socket-timo 0 connrefused 0 connreset 0 Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0 Anzahl der Dokumente: 3075 --end: Fri Jan 25 14:25:34 2002
Hi! Just my 2 eurocents:
I am developing a simple DMS. Up to now I use a python product with a BTreeFolder which contains all the documents. Every document gets an ID with DateTime().millis(). There will be up to 50 users working at the same time. And in the end I will have up to 3 million documents.
Is there a better class than BTreeFolder for such mass storage?
If it is mainly large documents (like MS Office or PDF files) you are trying to manage, the fastest way of handling this is using the filesystem for storage and serving. You could do the cataloging in Zope and hold link objects to the actual files in a Zope tree (and yes, if it is MANY objects, BTrees will be a good idea). These links could also manage the metadata. For the actual file serving, you'd use Apache (or if you can, SMB via Samba in the Intranet). I did some Benchmarks of Zope's input/output performance a couple of months ago. On a rather old Solaris machine (which has great IO throughput, but rather poor CPU performance), Apache could serve files almost at "wire speed", so the Ethernet card was the bottleneck. But Zope took much longer and consumed a lot of system resources. However, there is one important caveat with using Apache + Filesystem or Samba: You'll have to make sure that the files are secured by Apache, as Zope can not protect them on the file system level with the Zope security engine. I can't really see how an RDBMS would help you with performance. You'd need something professional, like Oracle (though PostgreSQL might do the job, too), and those servers eat RAM for breakfast and like fast CPUs. Of course, if you can spen a lot of money, Oracle will be the only solution that scales onto multiple servers. Then you could have almost any performance level you need, but at a price. It's a good question whether ZEO could help. As long as you have only one main DB, certainly not. This one will always be the bottleneck for write access. But of course you could put up separate DBs on separate servers, e.g. have servers for each department. Those servers could do their own indexing, and a centralized index server could retrieve the index information from them. ZEO would also help with search requests, as the index objects will be cached in the ZEO clients. But I'd do some benchmarks first. Even if you have 50 concurrent users, you'll probably not have 50 users posting docs at the same moment. Uploading a document will certainly not be the problem. The most time-consuming task will be the online indexing, so probably you'll have to forget about it and do a delayed batch indexing at night. Joachim
--- Joachim Werner <joe@iuveno-net.de> wrote:
Hi!
Just my 2 eurocents:
I am developing a simple DMS. Up to now I use a python product with a BTreeFolder which contains all the documents. Every document gets an ID with DateTime().millis(). There will be up to 50 users working at the same time. And in the end I will have up to 3 million documents.
Is there a better class than BTreeFolder for such mass storage?
If it is mainly large documents (like MS Office or PDF files) you are trying to manage, the fastest way of handling this is using the filesystem for storage and serving. You could do the cataloging in Zope and hold link objects to the actual files in a Zope tree (and yes, if it is MANY objects, BTrees will be a good idea). These links could also manage the metadata.
I thoroughly agree. Having developed a DMS myself, My cut-off point (which is really just an engineering intuition more than anything) was at about 5000 documents, it would be best to store them directly in the file system. Now, since the DMS I developed (DocumentLibrary) was for a target of < 5000 documents, I went for the simpler route of storing them in a BTreeFolder. What you will have to do to make an effective FS storage system, is create code that processes uploads and places them in an arbitrary hierarchy. Obviously putting 3 million documents in one FS directory will just plain fail in most FSes and at worst will perform dismally. You'll need to devise a way for the system to subdivide amongst a shallow hierarchy of dirs, something like Squid does with its cache directories. For serving the files you could use Apache, but I might be tempted to try something simpler, like micro httpd or tux or something light-weight. I agree that serving static binaries is not ZServer's strong suit. I guess that choice will depend on the frequency and size of downloads. Another thought might be to store the files in the FS and proxy them through Zope, like ExtFile does. Then put Squid in front of Zope to cache them so that they are only served the first time from Zope. Then you don't have to worry about what stuff is getting served from where. BTW: If you do set up any nifty FS storage solution, I would be interested in seeing it for future version of DocumentLibrary. Good Luck! -Casey __________________________________________________ Do You Yahoo!? Great stuff seeking new owners in Yahoo! Auctions! http://auctions.yahoo.com
Casey Duncan wrote:
--- Joachim Werner <joe@iuveno-net.de> wrote:
Hi!
Just my 2 eurocents:
I am developing a simple DMS. Up to now I use a python product with a
BTreeFolder which contains all the documents. Every document gets an ID with
DateTime().millis(). There will be up to 50 users working at the same time. And in the end I will have
up to 3 million documents.
Is there a better class than BTreeFolder for such mass storage?
I thoroughly agree. Having developed a DMS myself, My cut-off point (which is really just an engineering intuition more than anything) was at about 5000 documents, it would be best to store them directly in the file system.
Unfortunately my documents are not static files. They are python classes. They contain fields for meta data and sometimes binary files. Since my documents are not files I think it is not usefull to use a filesystem. The next idea would be to have a database (maybe mysql, because it is fast) with one table containing two colums: an ID and the pickled object. Or you could use berkley DB for this.
BTW: If you do set up any nifty FS storage solution, I would be interested in seeing it for future version of DocumentLibrary.
This list helped me a lot. I will try to give it back and write a little howto if I got a good solution. thomas
Hi!
Unfortunately my documents are not static files. They are python classes. They contain fields for meta data and sometimes binary files. Since my documents are not files I think it is not usefull to use a filesystem.
But I guess the large parts are always in the binaries, and binaries can be stored as files. Another issue is serving the objects. Will you have to return a combination of the binary and the properties to the client? If yes, you could still go with the caching approach that was suggested or save pre-rendered objects to the filesystem. The next
idea would be to have a database (maybe mysql, because it is fast) with one table containing two colums: an ID and the pickled object.
Even if there are other opinions on this list: I can't believe that MySQL can be THAT efficient with large files. As a matter of fact, it stores large binaries as files. So it can't be more efficient than the file system approach. And MySQL also will be a problem if you have a lot of concurrent reads and writes. It is really fast with a few clients, but with many clients PostgreSQL is supposed to skale better.
Or you could use berkley DB for this.
The BerkeleyDB implementation of the ZODB is not really any faster AFAIK. Just more flexible WRT building DBs that are "packless", non-undoing and the like. Another comment: Why can many large docs slow down Zope? Because it will try to do things like "objectValues", which will wake up all the children of a folderish object. So if you can avoid these (i.e. avoid or customize the ObjectManager API) and use BTreeFolders, many of the problems will probably go away. I don't know about the ZCatalog (whether it is more efficient or less efficient than doing the indexing in an RDBMS). Probably the whole thing really has to be benchmarked properly. If we end up with the RDBMS solution being most efficient for indexing, we might end up with a combination of all three: - Zope and ZODB for the "glueing" together - RDBMS for indexing and probably storing the properties, too - File System + Apache for serving large files Joachim
--- Thomas Guettler <zopestoller@thomas-guettler.de>
Unfortunately my documents are not static files. They are python classes. They contain fields for meta data and sometimes binary files.
Then my suggestion would be to store the object instances in a BTree of some kind (my suggestion would be an IOBTree, which would just index them with an integer), but keep the binary data separated out in external files. It would be straightforward to write a wrapper class that can manipulate the external binaries, just storing the file path in the ZODB. Storing pickles in an external database yourself makes not sense, That's what the ZODB already does.
thomas
__________________________________________________ Do You Yahoo!? Great stuff seeking new owners in Yahoo! Auctions! http://auctions.yahoo.com
participants (3)
-
Casey Duncan -
Joachim Werner -
Thomas Guettler