[ZODB-Dev] [BTrees] considerable optimization potential
in"intersection"
Tim Peters
tim at zope.com
Thu Aug 5 00:47:16 EDT 2004
[Andreas Jung]
> The vastly different sizes is the key point for this optimization. In
> this case I have the knowledge on the application level that the first
> BTree will never be larger than several thousand items compared to
> several hundred thousand items in the seconds BTree. Making this
> optimization decision on the BTree level might not work since you can not
> determine the size of a BTree without reading all keys which is totally
> inefficient
That's correct. A BTree has no idea how big it is, and no way to find out
for sure short of loading every bucket in the tree. You can get a (possibly
very) poor guess by scanning some small number of parent nodes "top down",
and making lots of dubious "average case" assumptions.
The algorithms in the paper I referenced generally use strategies that
really don't *need* to know the sizes in advance, although the paper doesn't
consider that case (I did).
> (btw. do btrees not implement __len__ through a counter?).
I'm afraid I'd have to speak German to parse that question <wink>.
__len__ is implemented by loading every bucket in the tree, and adding
together the size of each bucket.
A global "number of items in the whole tree" member would need to be updated
on every delete and insert, and conflicts would be rampant as a result. The
way things are, multiple transactions mucking with distinct buckets
generally don't interfere with each other at all.
You're aware of Length.py, right? That supplies a persistent counter class
with appropriate conflict resolution (never fails). You have to keep that
in synch with a BTree yourself, telling the Length object you associate with
a given BTree how many items you're adding to, or deleting from, the tree.
That's one way to get the length cheaply, but at some effort, and in a way
that's easy to get out of synch with the true current size.
With enough effort, I think that could get folded into the BTree
implementation. Nobody up my mgmt chain has suggested that's a priority,
though <wink>.
More information about the ZODB-Dev
mailing list