abel deuring wrote:
A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.
each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes.
If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.
Nah... I don't think so. At least I hope not! Each bucket in a BTree is a separate persistent object. So only the sum of the data in the updated buckets will be appended to the ZODB. So if you add an item to a BTree, you don't add 24000+ bytes for each update. You just add the amount of space taken up by the bucket... unfortunately I don't know exactly how much this is, but I'd imagine it's pretty close to the datasize with only a little overhead.
This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog".
Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions.
OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents.
Another option is to use a session manager, and update the catalog at session-end. - C