For huge inserts like that, have you looked at the more modern alternatives such as Tokyo Cabinet or MongoDB? I heard about an experiment to transfer 20 million text blobs into a Tokyo Cabinet. The first 10 million inserts were superfast but after that it started to take up to a second to insert each item. I'm not famililar with how good they are but I know they both have indexing. And I'm confident they both have good Python APIs. Or watch Bob Ippolitos PyCon 2009 talk on "Drop ACID". 2009/4/27 Hedley Roos <hedleyroos@gmail.com>:
I've followed this thread with interest since I have a Zope site with tens of millions of entries in BTrees. It scales well, but it requires many tricks to make it work.
Roche Compaan wrote these great pieces on ZODB, Data.fs size and scalability at http://www.upfrontsystems.co.za/Members/roche/where-im-calling-from/catalog-... and http://www.upfrontsystems.co.za/Members/roche/where-im-calling-from/fat-does... .
My own in-house product is similar to GoogleAnalytics. I have to use a cascading BTree structure (a btree of btrees of btrees) to handle the volume. This is because BTrees do slow down the more items they contain. This is not a ZODB limitation or flaw - it is just how they work.
My structure allows for fast inserts, but they also allow aggregation of data. So if my lowest level of BTrees store hits for a particular hour in time then the containing BTree always knows exactly how many hits were made in a day. I update all parent BTrees as soon as an item is inserted. The cost of this operation is O(1) for every parent. These are all details but every single one influenced my design.
What is important is that you cannot just use the ZCatalog to index tens of millions of items since every index is a single BTree and will thus suffer the larger it gets. So you must roll your own to fit your problem domain.
Data warehousing is probably a good idea as well.
My problem domain allows me to defer inserts, so I have a queuerunner that commits larger transactions in batches. This is better than lots of small writes. This may of course not fit your model.
Familiarize yourself with TreeSets and set operations in Python (union etc.) since those tools form the backbone of catalogueing.
Hedley _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
-- Peter Bengtsson, work www.fry-it.com home www.peterbe.com hobby www.issuetrackerproduct.com