[ZODB-Dev] Advice on ZODB with large datasets
AFoglia at princeton.com
AFoglia at princeton.com
Wed Jun 18 12:03:10 EDT 2008
We have a large dataset of 650,000+ records that I'd like to examine
easily in Python. I have figured out how to put this into a ZODB file
that totals 4 GB in size. But I'm new to ZODB and very large databases,
and have a few questions.
1. The data is in a IOBTree so I can access each item once I know the
key, but to get the list of keys I tried:
scores = root['scores']
ids = [id for id in scores.iterkeys()]
This seems to require the entire tree to be loaded into memory which
takes more RAM than I have.
If I instead avoid the list comprehension and use an actual loop, I can
explicitly call cacheMinimize every n records, and keep the memory
reasonable.
So, how and when does the cache normally get minimized? Should I just
avoid list comprehensions and explicitly clean the cache the way I'm
doing, or is there any tricks to minimize the RAM usage.
2. Obviously I should save my list of keys in the database. I'd also
like to have other indexes. It appears the usual technique is to use
ZCatalog <http://www.blazingthings.com/dev/zcatalog.html>. Am I
correct? Is there any good documentation on how to use that with ZODB?
(All the examples I can find either were on using the catalog from
within Zope, to using the catalog in a purely standalone manner.) Are
there any concerns I should be aware of for using it with large datasets?
3. Are there any guides to how to tune my ZODB usage? I had to dig
around a while for to realize I should be using BTrees and the
cacheMinimize method. Are there any other knobs I should know?
So far, I've simply read the data from an XML file and converted it.
I've set the cache size to 1000, and every 10000 entries, I commit the
transaction, and minimize the caches. The conversion takes about 60
hours to run and uses roughly half my memory, which is acceptable, but
if I can tune it to be faster at the cost of slightly more memory, I'd
be happier. (The performance is roughly O(N^2), although halfway
through it's closer to O(N^2.7).)
Thanks in advance.
More information about the ZODB-Dev
mailing list