[ZODB-Dev] Re: Advice on ZODB with large datasets

Fri Jun 20 11:17:02 EDT 2008

Laurence Rowe wrote:
> It's helpful to post your responses to the mailing list, that way when
> someone else has a similar problem in the future they'll be able to
> find the information.
> 
> Inheriting from Persistent is also necessary to control the
> granularity of the database. Persistent objects are saved as separate
> `records` by ZODB. Other objects do not have a _p_oid attribute and
> have to be saved as part of their parent record.

I made the changes yesterday and there was a huge benefit.  The original 
method was all entries were simple Python dictionaries and they were 
values of a IOBTree.  The only change I made was from

scores[article['key']] = article

to

scores[article['key']] = PersistentMapping(article)

(where scores is the IOBTree).

My cache size is 1000 items, and after every 10000 I commit the 
transaction, clear the caches, and garbage collect.  At the end I pack 
the database to drop the history.

I'm dealing with a 20GB XML file with 670000+ entries.  The original 
version too about 2 1/4 days to run.  The new version, about 6 1/2 
hours.  The dict version behaves as O(N^2) (or worse), the 
PersistentMapping is a steady O(N).  The dict version is slightly faster 
for less than 100,000 items, but only about 10 minutes or so.

The RAM usage for the dictionary version slowly increased to about 18 
GB, while the PersistentMapping version stayed nearly constant, slowly 
increasing from 646 MB at 10000 records to 803 MB.  (These numbers 
include the Python interpreter and everything else in the process.)

The final, packed versions are roughly the same size (4.24 GB for the 
dict version, 4.29 GB for the PersistentMapping).  A greater gain is 
seen in the history; the old, pre-packing size is 91 GB for the dict, 
versus 4.6 GB for the PersistentMapping.

Most importantly, I can open up the database and do simple things like 
get the number of entries and all the ids much quicker and with little 
memory usage.

Thanks for the help.

Now, my next step is to figure out how to best index this, for which I 
plan to use zc.catalog.  Its SetIndex seems to be best for my situation.