[ZODB-Dev] Re: self.length._p_deactivate() and MVCC

Tim Peters tim at zope.com
Sun May 2 22:48:37 EDT 2004


[Casey Duncan]
> ...
> Wids are assigned in ascending order to allow the document word
> lists to be compressed better, I think it assumes popular words
> will tend to get lower wids.

It doesn't assume that, but it does rely on it <wink>.  Really, it's exactly
the cheap hack it appears to be.  If, say, a word occurs in 90% of all
documents, uniformly distributed, then there's a 90% chance that it appears
in the first document to get indexed, and then it's likely to get a small
"id".  Alas, it's typical that words appearing *only* in the first document
to be indexed also get small ids.

Replacing words with integer ids is effective at compression overall, and
also speeds searches, but the finer distinctions among wids of different
sizes may be more trouble than they're actually worth.  It would be a little
smarter to reserve the smallest ids for words that appear more than once in
a document; but then the wid assignments would "have gaps" too, and that
would complicate other logic.




More information about the ZODB-Dev mailing list