Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

26 Jun 2001


      abel deuring wrote:
...
A text index (class SearchIndex.UnTextIndex) is definetely is a cause of
bloating, if you use CatalogAware objects. An UnTextIndex maintains for
Right.. if you don't use CatalogAware, however, and don't unindex before
reindexing an object, you should see a huge bloat savings, because the
only things which are supposed to be updated then are indexes and
metadata which have data that has changed.
...
each word a list of documents, where this word appears. So, if a
document to be indexed contains, say, 100 words, 100 IIBTrees
(containing mappings documentId -> word score) will be updated. (see
UnTextIndex.insertForwardIndexEntry) If you have a larger number of
documents, these mappings may be quite large: Assume 10.000 documents,
and assume that you have 10 words which appear in 30% of all documents.
Hence, each of the IIBTrees for these words contains 3000 entries. (Ok,
one can try to keep this number of frequent words low by using a "good"
stop word list, but at least for German, such a list is quite difficult
to build. And one can argue that many "not too really frequent" words
should be indexed in order to allow more precise phrase searches)I don't
know the details, how data is stored inside the BTress, so I can give
only a rough estimate of the memory requirements: With 32 bit integers,
we have at least 8 bytes per IIBTree entry (documentId and score), so
each of the 10 BTree for the "frequent words" has a minimum length of
3000*8 = 24000 bytes.
If you now add a new document containing 5 of these frequent words, 5
larger BTrees will be updated. [Chris, let me know, if I'm now going to
tell nonsense...] I assume that the entire updated BTrees = 120000 bytes
will be appended to the ZODB (ignoring the less frequent words) -- even
if the document contains only 1 kB text.
Nah... I don't think so.  At least I hope not!  Each bucket in a BTree
is a separate persistent object.  So only the sum of the data in the
updated buckets will be appended to the ZODB.  So if you add an item to
a BTree, you don't add 24000+ bytes for each update.  You just add the
amount of space taken up by the bucket... unfortunately I don't know
exactly how much this is, but I'd imagine it's pretty close to the
datasize with only a little overhead.
...
This is the reason, why I'm working on some kind of "lazy cataloging".
My approach is to use a Python class (or Base class,if ZClasses are
involved), which has a method manage_afterAdd. This method looks for
superValues of a type like "lazyCatalog" (derived from ZCatalog), and
inserts self.getPhysicalPath() into the update list of each found
"lazyCatalog".
Later, a "lazyCatalog" can index all objects in this list. Then, then
bloating happens either in RAM (without subtransaction), or in a
temporary file, if you use subtransactions.
OK, another approach which fits better to your (Giovanni) needs might be
to use another data base than ZODB, but I'm afarid that even then
"instant indexing" will be an expensive process, if you have a large
number of documents.
Another option is to use a session manager, and update the catalog at
session-end.

- C

Re: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough