Giovanni Maruzzelli wrote:
Hello Zopistas,
we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index and reindex some tens of thousands objects (Zclass that are DTMLDocument on steroids) on some twenty properties each day, while the absolute number of objects cataloged keeps growing (think at content management for a big portal, where each day lots of content are added and modified and all the old content remains as a searchable archive and as material to recycle in the future).
This seems for some aspects a task similar to what Erik Enge impacted couple a weeks ago.
We first derived from CatalogAware, then switched to manage ourselves the cataloging - uncataloging - recataloging.
The ZODB still bloat at a too much fast pace.
***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). If we pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms).
Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided?
Erik, any toughts?
We are almost decided to switch to berkeleydb storage (the Minimal one) to get rid of the bloating, we are testing with it, but it seems to be discontinued because of a lack of requests.
Any light on it? Is it production grade?
Giovanni, I experienced similar problems trying to catalog ·~200000 objects with ~500 MB text. Using CatalogAware objects will indeed lead to a "really fat" data base, and using the "find objects" for a ZCatalog requires considerable resources. A text index (more precise: the class UnTextIndex) works, as far as I understood it, this way: 1. The method UnTextIndex.index_object splits the text into single words, using the method [Globbing]Lexicon.Splitter. 2. UnTextIndex.index_object looks up the wordID (an integer) of each word in the lexicon. If a word is not yet listed in the lexicon, it is added to the lexicon. 3. All wordIDs are inserted into self._index, which maps wordIDs to the list of documents containing this word. 4. The "unindex" BTree , which maps the documentIds to the the list of all words appearing in an document is updated. If you are adding only one CatalogAware object during a transaction, this is quite expensive: Even if the indexed object contains only one new word, the entire lexicon needs to be updated. In my tests with the 200000 objects (containing ordinary German texts) the lexicon contained ~ 1 million words. (BTW, I had not had a very close look into the contents of the lexicon, so I don't know yet exactly, why it is so large. But I noticed quite many entries like "38-jährige", "42-jährige" ("NN-year-old") entries. So a configurable splitter method might help quite much to reduce the size of the lexicon.) Hence, the above mentioned step 2 alone can result in a really bloated data base. A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...) But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough... As I see it, the main problem is that UnTextIndex.index_object tries to do all work at once: Updating the lexicon _and_ self._index _and_ self._unindex So I tried to separate these tasks by writing the data to be stored in self._index (wordId, documentId, score) into a pipe. This pipe is connected to sort(1). After all objects have been "scanned", the pipe is closed, the sorted results are read back and self._index is updated. This way, Zope needed "only", uuhh, somewhat aroud 200 or 300 MB RAM. A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response. Abel