[ZODB-Dev] Re: Advice on ZODB with large datasets
AFoglia at princeton.com
AFoglia at princeton.com
Fri Jun 20 11:17:02 EDT 2008
Laurence Rowe wrote:
> It's helpful to post your responses to the mailing list, that way when
> someone else has a similar problem in the future they'll be able to
> find the information.
>
> Inheriting from Persistent is also necessary to control the
> granularity of the database. Persistent objects are saved as separate
> `records` by ZODB. Other objects do not have a _p_oid attribute and
> have to be saved as part of their parent record.
I made the changes yesterday and there was a huge benefit. The original
method was all entries were simple Python dictionaries and they were
values of a IOBTree. The only change I made was from
scores[article['key']] = article
to
scores[article['key']] = PersistentMapping(article)
(where scores is the IOBTree).
My cache size is 1000 items, and after every 10000 I commit the
transaction, clear the caches, and garbage collect. At the end I pack
the database to drop the history.
I'm dealing with a 20GB XML file with 670000+ entries. The original
version too about 2 1/4 days to run. The new version, about 6 1/2
hours. The dict version behaves as O(N^2) (or worse), the
PersistentMapping is a steady O(N). The dict version is slightly faster
for less than 100,000 items, but only about 10 minutes or so.
The RAM usage for the dictionary version slowly increased to about 18
GB, while the PersistentMapping version stayed nearly constant, slowly
increasing from 646 MB at 10000 records to 803 MB. (These numbers
include the Python interpreter and everything else in the process.)
The final, packed versions are roughly the same size (4.24 GB for the
dict version, 4.29 GB for the PersistentMapping). A greater gain is
seen in the history; the old, pre-packing size is 91 GB for the dict,
versus 4.6 GB for the PersistentMapping.
Most importantly, I can open up the database and do simple things like
get the number of entries and all the ids much quicker and with little
memory usage.
Thanks for the help.
Now, my next step is to figure out how to best index this, for which I
plan to use zc.catalog. Its SetIndex seems to be best for my situation.
More information about the ZODB-Dev
mailing list