RE: [Zope] Folder with one million Documents?
This will be taxing on Zope, so you need to be willing to be patient enough to optimize your application a bit. BTreeFolder works well for this, provided you are willing to consider bypassing use of the ObjectManager APIs and read/write to BTreeFolder._tree directly or use BTreeFolder._setOb() and BTreeFolder._getOb() instead of ObjectManager._getObject()... You also will REALLY need some nice hardware. I would suggest the fastest box you can get with LOTS of RAM. I would look at something along the lines of a Dual Athlon 2000+ (P-rated, not MHz) box with 3-4 GB RAM, and a striped RAID volume of fast disks. I have a BTreeFolder-derived folder and have populated it with about a third-of-a-million Cataloged objects, with each object using an underlying relational datastore, and about 8 Cataloged indexes, mostly field indexes index the result of a relational query; bulk adding these objects from a RDB datasource with cataloging takes about 2-3 hours on a P4 1.4GHz, and right now with my application, the Catalog is broken until a bulk-reindex on the advanced Tab of the Catalog - another 2 hours. I don't think BTreeFolder is a problem, but I would suspect that reindexing a Catalog with 3 million documents with full-text search setups would take you over 10-15 hours on a fast computer, longer if there is a complex amount of filtering document formats involved. Sean -----Original Message----- From: Thomas Guettler [mailto:zopestoller@thomas-guettler.de] Sent: Friday, January 25, 2002 6:56 AM To: zope@zope.org Subject: [Zope] Folder with one million Documents? Hi! I am developing a simple DMS. Up to now I use a python product with a BTreeFolder which contains all the documents. Every document gets an ID with DateTime().millis(). There will be up to 50 users working at the same time. And in the end I will have up to 3 million documents. Is there a better class than BTreeFolder for such mass storage? For the curious, here is one result of the benchmarks: I benchmark it with httperf: +5 requests per second +10MBit connection between client and server +every request creates a document Anzahl der Dokumente: 2159 httperf.exe: warning: open file limit > FD_SETSIZE; limiting max. # of open files to FD_SETSIZE httperf.exe --timeout=5 --client=0/1 --server=prophet --port=8080 --uri=/a/benchmarks/create_new_doc --rate=5 --send-buffer=4096 --recv-buffer=16384 --add-header='Authorization: Basic em9wZTp6b3Bl\n' --num-conns=1000 --num-calls=1 Maximum connect burst length: 1 Total: connections 1000 requests 1000 replies 571 test-duration 204.814 s Connection rate: 4.9 conn/s (204.8 ms/conn, <=26 concurrent connections) Connection time [ms]: min 110.0 avg 977.2 max 5257.0 median 289.5 stddev 1315.9 Connection time [ms]: connect 0.5 Connection length [replies/conn]: 1.000 Request rate: 4.9 req/s (204.8 ms/req) Request size [B]: 120.0 Reply rate [replies/s]: min 0.0 avg 2.9 max 5.2 stddev 2.2 (40 samples) Reply time [ms]: response 976.4 transfer 0.0 Reply size [B]: header 216.0 content 79.0 footer 0.0 (total 295.0) Reply status: 1xx=0 2xx=562 3xx=0 4xx=0 5xx=9 CPU time [s]: user 83.96 system 117.15 (user 41.0% system 57.2% total 98.2%) Net I/O: 1.4 KB/s (0.0*10^6 bps) Errors: total 429 client-timo 429 socket-timo 0 connrefused 0 connreset 0 Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0 Anzahl der Dokumente: 3075 --end: Fri Jan 25 14:25:34 2002 _______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
On Fri, Jan 25, 2002 at 01:53:15PM -0800, sean.upton@uniontrib.com wrote:
This will be taxing on Zope, so you need to be willing to be patient enough to optimize your application a bit. BTreeFolder works well for this, provided you are willing to consider bypassing use of the ObjectManager APIs and read/write to BTreeFolder._tree directly or use BTreeFolder._setOb() and BTreeFolder._getOb() instead of ObjectManager._getObject()...
Thank you for this information. I will try it on monday.
You also will REALLY need some nice hardware. I would suggest the fastest box you can get with LOTS of RAM. I would look at something along the lines of a Dual Athlon 2000+ (P-rated, not MHz) box with 3-4 GB RAM, and a striped RAID volume of fast disks.
OK
I have a BTreeFolder-derived folder and have populated it with about a third-of-a-million Cataloged objects, with each object using an underlying relational datastore, and about 8 Cataloged indexes, mostly field indexes index the result of a relational query;
Do you use the relational datastore for performance, or because the RDBMS was there before you decided to use zope? The development with a python product is very fast, I would prefere it if it would work without a RDBMS.
I don't think BTreeFolder is a problem, but I would suspect that reindexing a Catalog with 3 million documents with full-text search setups would take you over 10-15 hours on a fast computer, longer if there is a complex amount of filtering document formats involved.
The documents are python classes derived from Folder. Only few of really contain files. Must of them just use PropertyManager. Is it possible to do the cataloging on a different machine? This would reduce the load on the primary server. I don't think ZEO is a solution because there a lot of write access. Thank you for your answers. I think I will write a small HOWTO if I got it working. thomas -- Thomas Guettler <guettli@thomas-guettler.de> http://www.thomas-guettler.de
participants (2)
-
sean.upton@uniontrib.com -
Thomas Guettler