[Zope-dev] Re: ZOBD and pointers

21 Jun 2005

      [Yair Benita]
...
...
...
Reading this answer I understand that anything I store should be
persistent, even if its a list I don't plan to edit.
[Tim Peters]
...
...
I wouldn't say that.  For example, for _most_ applications it would be
foolish to create a subclass of Persistent to store an integer, as
opposed to just storing an integer directly.  I can conceive of
(unlikely!) applications where there may be advantages to storing
integers as perisistent objects, though.
[Tres Seaver]
...
As, for instance, where the integer changes much more frequently than
the other attributes, which are large enough that re-storing them just
because the integer attribute changed is painful.
Yup, that's a possible reason.  Another recently popped up, which I'll
exaggerate to make the point:  you have 100,000 distinct integer ids,
and you have 10,000 objects each with a (Python) list containing
10,000 of those ids.  If you load those all into memory, Python will
allocate space for 10000*10000 = 100 million integer objects, and that
will consume more than a gigabyte of RAM.  But if integers are stored
as one unique persistent object per unique integer, it can't require
more than 100 thousand distinct persistent integers in memory (because
that's the total number of distinct integer ids).  The RAM difference
is a factor of about 1000 (but ignoring that it takes more RAM to hold
a persistent wrapper than to hold a straight integer).

I'll note that IISets avoid this problem via a different route:  they
hold their integers as raw bits, not as Python integer objects.  When
you extract an element from an IISet, a Python integer object is
created on-the-fly to wrap the bits.
...
Making the attribute a persistent sub-object also eliminates the chance of a
ConflictError based on changes to the other attributes.
I didn't follow that one.  If other attributes change, they can
trigger conflict errors, right?
...
This is the use case which drives BTrees.Length, right?
The important part of that is its conflict resolution method, which
keeps track of the correct final size of a BTree in the face of
concurrent mutations.  BTrees don't keep track of their own size
because every addition or deletion would have to percolate the change
in size back up to the root of the BTree, and we'd get conflict errors
on the root object then.  As is, most additions and deletions change
only the leaf Bucket node where the mutation takes place, giving
mutation often-useful spatial locality in the face of concurrent
mutations.

I wish we could do better than that, though:  from what I see, most
people don't realize that len(some_BTree) takes time linear in the
number of elements, and sucks the entire BTree into RAM.  The rest
seem to have trouble, at least at first, using BTrees.Length
correctly.  I suppose that's what you get when a scheme is driven by
pragmatic implementation compromises instead of by semantic necessity.
 Give enough pain, it should be possible to hide the BTrees.Length
strategy under the covers, although I'm not sure the increase in
storage size could be justified to users who have mastered the details
of doing it manually (the problem being that many uses for BTrees
never care to ask for the size, so wouldn't want to pay extra
overheads for keeping track of size efficiently).