[Zope] Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

25 Jun 2001

      Giovanni Maruzzelli wrote:
...
Hello Zopistas,
we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index
and reindex some tens of thousands
objects (Zclass that are DTMLDocument on steroids) on some twenty properties
each day, while
the absolute number of objects cataloged keeps growing (think at content
management for a big
portal, where each day lots of content are added and modified and all the
old content remains as a
searchable archive and as material to recycle in the future).
This seems for some aspects a task similar to what Erik Enge impacted couple
a weeks ago.
We first derived from CatalogAware, then switched to manage ourselves the
cataloging - uncataloging - recataloging.
The ZODB still bloat at a too much fast pace.
***Maybe there's something obvious we missed***, but when you have some
4thousands object in the catalog, if you add and catalog one more object
the ZODB grows circa a couple of megabyte (while the object is some 1 k of
text, and some twelve boolean and datetime and strings properties). If we
pack the ZODB, Data.fs returns to an almost normal size (so the bloat are
made by the transactions as tranalyzer.py confirms).
Any hints on how to manage something like?
We use both textindexes, fieldindexes, and keywordsindexes (textindex on
string properties, fieldindexes on boolean and datetime, keywordindex on
strings). Maybe one kind of indexes is to be avoided?
Erik, any toughts?
We are almost decided to switch to berkeleydb storage (the Minimal one) to
get rid of the bloating, we are testing with it, but it seems to be
discontinued because of a lack of requests.
Any light on it? Is it production grade?
Giovanni,

I experienced similar problems trying to catalog ·~200000 objects with
~500 MB text. Using CatalogAware objects will indeed lead to a "really
fat" data base, and using the "find objects" for a ZCatalog requires
considerable resources.

A text index (more precise: the class UnTextIndex) works, as far as I
understood it, this way:

1. The method UnTextIndex.index_object splits the text into single
words, using the method [Globbing]Lexicon.Splitter. 

2. UnTextIndex.index_object looks up the wordID (an integer) of each
word in the lexicon. If a word is not yet listed in the lexicon, it is
added to the lexicon. 

3. All wordIDs are inserted into self._index, which maps wordIDs to the
list of documents containing this word.

4. The "unindex" BTree , which maps the documentIds to the the list of
all words appearing in an document is updated.

If you are adding only one CatalogAware object during a transaction,
this is quite expensive: Even if the indexed object contains only one
new word, the entire lexicon needs to be updated. In my tests with the
200000 objects (containing ordinary German texts) the lexicon contained
~ 1 million words. (BTW, I had not had a very close look into the
contents of the lexicon, so I don't know yet exactly, why it is so
large. But I noticed quite many entries like "38-jährige", "42-jährige"
("NN-year-old") entries. So a configurable splitter method might help
quite much to reduce the size of the lexicon.) Hence, the above
mentioned step 2 alone can result in a really bloated data base.

A solution might be a kind of "lazy catalog awareness": Instead of
mangling a new object through one or more catalogs when it is created,
this object could be added to a list of objects to be cataloged later.
This way, the transaction to insert a new object would become much
"cheaper". I'm working on this, but right now it is quite messy. (I'm
new to Python and Zope, and hence I'm stumbling over a few, hmmm,
trip-wires...)

But even using such a "lazy catalog awareness", you might get into
trouble. Using the ZCatalog's "find objects" function, I hit the limits
of my Linux box: 640 MB RAM were not enough...

As I see it, the main problem is that UnTextIndex.index_object tries to
do all work at once: Updating the lexicon _and_ self._index _and_
self._unindex

So I tried to separate these tasks by writing the data to be stored in
self._index (wordId, documentId, score) into a pipe. This pipe is
connected to sort(1). After all objects have been "scanned", the pipe is
closed, the sorted results are read back and self._index is updated.
This way, Zope needed "only", uuhh, somewhat aroud 200 or 300 MB RAM.

A few weeks ago, I've posted this (admittedly not fully cooked) patch to
this list, but did not get yet any response.

Abel

[Zope] Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

abel deuring