Zcatalog bloat problem (berkeleydb is a solution?)
Hello Zopistas, we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index and reindex some tens of thousands objects (Zclass that are DTMLDocument on steroids) on some twenty properties each day, while the absolute number of objects cataloged keeps growing (think at content management for a big portal, where each day lots of content are added and modified and all the old content remains as a searchable archive and as material to recycle in the future). This seems for some aspects a task similar to what Erik Enge impacted couple a weeks ago. We first derived from CatalogAware, then switched to manage ourselves the cataloging - uncataloging - recataloging. The ZODB still bloat at a too much fast pace. ***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). If we pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms). Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided? Erik, any toughts? We are almost decided to switch to berkeleydb storage (the Minimal one) to get rid of the bloating, we are testing with it, but it seems to be discontinued because of a lack of requests. Any light on it? Is it production grade? -giovanni
(I trimmed the CC list) On Mon, 25 Jun 2001 14:34:55 +0200, "Giovanni Maruzzelli" <maruzz@open4.it> wrote:
***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms).
I have some patches to tranalyzer that dumps the class name of every object written by each transaction. This helped me track down a similar bug where I was modifying more objects that I thought. Any interest? Toby Dickenson tdickenson@geminidataloggers.com
"GM" == Giovanni Maruzzelli <maruzz@open4.it> writes:
GM> We are almost decided to switch to berkeleydb storage (the GM> Minimal one) to get rid of the bloating, we are testing with GM> it, but it seems to be discontinued because of a lack of GM> requests. GM> Any light on it? Is it production grade? There are currently two versions of non-undo, non-versioning Berkeley based storages. One is called Packless.py and one is called Minimal.py. Packless was written first, and is truly packless; it uses reference counting to get rid of unused objects. Minimal isn't quite packless yet, since the reference counting hasn't been added. OTOH, Minimal shares much as much implementation as possible with Full, so it shares e.g. the more robust backing commit log file stuff. A goal, although not likely for 1.0 is to merge the best features of Minimal and Packless into a single storage. I don't think it's a lot of work to do this, but Jim hasn't pressed this has as having a very high priority. -Barry
Giovanni Maruzzelli wrote:
Hello Zopistas,
we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index and reindex some tens of thousands objects (Zclass that are DTMLDocument on steroids) on some twenty properties each day, while the absolute number of objects cataloged keeps growing (think at content management for a big portal, where each day lots of content are added and modified and all the old content remains as a searchable archive and as material to recycle in the future).
This seems for some aspects a task similar to what Erik Enge impacted couple a weeks ago.
We first derived from CatalogAware, then switched to manage ourselves the cataloging - uncataloging - recataloging.
The ZODB still bloat at a too much fast pace.
***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). If we pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms).
Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided?
Erik, any toughts?
We are almost decided to switch to berkeleydb storage (the Minimal one) to get rid of the bloating, we are testing with it, but it seems to be discontinued because of a lack of requests.
Any light on it? Is it production grade?
Giovanni, I experienced similar problems trying to catalog ·~200000 objects with ~500 MB text. Using CatalogAware objects will indeed lead to a "really fat" data base, and using the "find objects" for a ZCatalog requires considerable resources. A text index (more precise: the class UnTextIndex) works, as far as I understood it, this way: 1. The method UnTextIndex.index_object splits the text into single words, using the method [Globbing]Lexicon.Splitter. 2. UnTextIndex.index_object looks up the wordID (an integer) of each word in the lexicon. If a word is not yet listed in the lexicon, it is added to the lexicon. 3. All wordIDs are inserted into self._index, which maps wordIDs to the list of documents containing this word. 4. The "unindex" BTree , which maps the documentIds to the the list of all words appearing in an document is updated. If you are adding only one CatalogAware object during a transaction, this is quite expensive: Even if the indexed object contains only one new word, the entire lexicon needs to be updated. In my tests with the 200000 objects (containing ordinary German texts) the lexicon contained ~ 1 million words. (BTW, I had not had a very close look into the contents of the lexicon, so I don't know yet exactly, why it is so large. But I noticed quite many entries like "38-jährige", "42-jährige" ("NN-year-old") entries. So a configurable splitter method might help quite much to reduce the size of the lexicon.) Hence, the above mentioned step 2 alone can result in a really bloated data base. A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...) But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough... As I see it, the main problem is that UnTextIndex.index_object tries to do all work at once: Updating the lexicon _and_ self._index _and_ self._unindex So I tried to separate these tasks by writing the data to be stored in self._index (wordId, documentId, score) into a pipe. This pipe is connected to sort(1). After all objects have been "scanned", the pipe is closed, the sorted results are read back and self._index is updated. This way, Zope needed "only", uuhh, somewhat aroud 200 or 300 MB RAM. A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response. Abel
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)
This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...
This should not happen. :-( I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.
I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-( - C
Hello Zopistas, thank'you all for your replies. Our doubts still unresolved :-( With a clever hack that Toby Dickenson made on the very useful tranalyzer, we was able to see what happen when we add or catalog an object. (BTW, we don't use CatalogAware). We can send the output of tranalyzer2 to anyone interested, but in short words this is what happens in an empty folder (and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K): if we add a normal DTML document (no catalog involved) in an empty folder we have a very small increase in size: the size of the dtml and the size of the folder: TID: 33D853C2CE6CDBB @ 77396692 obs 2 len 729 By ciao "/aacucu/addDTMLDocument" OID: 40817 len 270 [OFS.Folder.Folder] OID: 40818 len 309 [OFS.DTMLDocument.DTMLDocument] if we add an "Articolo" that's cataloged on the fly in the same empty directory we have a bloating: TID: 33D853D722FA167 @ 77397437 obs 96 len 226568 By ciao "/aacucu/Articolo_add" OID: 40817 len 363 [OFS.Folder.Folder] OID: 40819 len 598 [*ennPsHQQKY5zjxlQs1ebmA==.Articolo] OID: 407b5 len 8074 [BTrees.IOBTree.IOBucket] OID: 37aa9 len 39 [BTrees.Length.Length] OID: 37b95 len 1483 [BTrees.OIBTree.OIBucket] OID: 407b7 len 1739 [BTrees.IOBTree.IOBucket] OID: 407b8 len 402 [BTrees.IIBTree.IISet] OID: 407b9 len 399 [BTrees.IOBTree.IOBucket] OID: 407ba len 402 [BTrees.IIBTree.IISet] OID: 407bb len 3497 [BTrees.IOBTree.IOBucket] OID: 407bc len 5871 [BTrees.OOBTree.OOBucket] OID: 37ab2 len 39 [BTrees.Length.Length] OID: 407c6 len 6279 [BTrees.IOBTree.IOBucket] OID: 3d7bf len 312 [BTrees.IIBTree.IISet] OID: 407c7 len 4507 [BTrees.IOBTree.IOBucket] OID: 3c992 len 837 [BTrees.OOBTree.OOBucket] OID: 37abe len 39 [BTrees.Length.Length] OID: 407d2 len 696 [BTrees.IOBTree.IOBucket] OID: 3cb4e len 572 [BTrees.IIBTree.IISet] OID: 407d3 len 537 [BTrees.IOBTree.IOBucket] OID: 40809 len 387 [BTrees.IIBTree.IISet] OID: 407d4 len 507 [BTrees.IOBTree.IOBucket] OID: 4080a len 387 [BTrees.IIBTree.IISet] OID: 407d5 len 507 [BTrees.IOBTree.IOBucket] OID: 4080b len 387 [BTrees.IIBTree.IISet] OID: 407d6 len 507 [BTrees.IOBTree.IOBucket] OID: 4080c len 387 [BTrees.IIBTree.IISet] OID: 407d7 len 339 [BTrees.IOBTree.IOBucket] OID: 4080d len 382 [BTrees.IIBTree.IISet] OID: 407d8 len 339 [BTrees.IOBTree.IOBucket] OID: 4080e len 382 [BTrees.IIBTree.IISet] OID: 407d9 len 339 [BTrees.IOBTree.IOBucket] OID: 3d064 len 597 [BTrees.IIBTree.IISet] OID: 407da len 347 [BTrees.IOBTree.IOBucket] OID: 4080f len 387 [BTrees.IIBTree.IISet] OID: 407db len 339 [BTrees.IOBTree.IOBucket] OID: 3d1ba len 642 [BTrees.IIBTree.IISet] OID: 407dc len 339 [BTrees.IOBTree.IOBucket] OID: 40810 len 372 [BTrees.IIBTree.IISet] OID: 407dd len 339 [BTrees.IOBTree.IOBucket] OID: 40811 len 372 [BTrees.IIBTree.IISet] OID: 407de len 339 [BTrees.IOBTree.IOBucket] OID: 37f11 len 977 [BTrees.IOBTree.IOBucket] OID: 380de len 830 [BTrees.OIBTree.OIBucket] OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37ac7 len 9892 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 38922 len 387 [BTrees.IIBTree.IISet] OID: 38643 len 827 [BTrees.IIBTree.IISet] OID: 3894c len 92 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 38581 len 277 [BTrees.IIBTree.IISet] OID: 3d7f7 len 319 [BTrees.IOBTree.IOBTree] OID: 3d7f8 len 356 [BTrees.IOBTree.IOBTree] OID: 40812 len 372 [BTrees.IIBTree.IISet] OID: 407e0 len 339 [BTrees.IOBTree.IOBucket] OID: 40813 len 387 [BTrees.IIBTree.IISet] OID: 407e1 len 339 [BTrees.IOBTree.IOBucket] OID: 40814 len 362 [BTrees.IIBTree.IISet] OID: 407e2 len 507 [BTrees.IOBTree.IOBucket] OID: 37eb9 len 981 [BTrees.IOBTree.IOBucket] OID: 38197 len 804 [BTrees.OIBTree.OIBucket] OID: 38ac7 len 7947 [BTrees.IIBTree.IISet] OID: 387f6 len 97 [BTrees.IIBTree.IISet] OID: 383f7 len 850 [BTrees.OOBTree.OOBucket] OID: 4081a len 47 [BTrees.IIBTree.IISet] OID: 38407 len 850 [BTrees.OOBTree.OOBucket] OID: 4081b len 47 [BTrees.IIBTree.IISet] OID: 388ac len 92 [BTrees.IIBTree.IISet] OID: 387d4 len 152 [BTrees.IIBTree.IISet] OID: 3868c len 152 [BTrees.IIBTree.IISet] OID: 38681 len 142 [BTrees.IIBTree.IISet] OID: 388b0 len 72 [BTrees.IIBTree.IISet] OID: 384f1 len 52 [BTrees.IIBTree.IISet] OID: 37ca6 len 586 [BTrees.IOBTree.IOBucket] OID: 4081c len 686 [BTrees.IOBTree.IOBucket] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 381d8 len 594 [BTrees.OIBTree.OIBucket] OID: 38ac9 len 1252 [BTrees.IIBTree.IISet] OID: 38770 len 52 [BTrees.IIBTree.IISet] OID: 37d94 len 1234 [BTrees.IOBTree.IOBucket] OID: 3821d len 617 [BTrees.OIBTree.OIBucket] OID: 38acb len 557 [BTrees.IIBTree.IISet] OID: 38710 len 52 [BTrees.IIBTree.IISet] OID: 386ac len 52 [BTrees.IIBTree.IISet] OID: 38409 len 1019 [BTrees.OOBTree.OOBucket] OID: 4081d len 47 [BTrees.IIBTree.IISet] OID: 3870b len 52 [BTrees.IIBTree.IISet] OID: 38403 len 816 [BTrees.OOBTree.OOBucket] OID: 4081e len 47 [BTrees.IIBTree.IISet] OID: 387fe len 57 [BTrees.IIBTree.IISet] OID: 387cc len 67 [BTrees.IIBTree.IISet] OID: 38b29 len 1228 [BTrees.IOBTree.IOBucket] OID: 38c19 len 904 [BTrees.IOBTree.IOBucket] OID: 38d37 len 1007 [BTrees.IOBTree.IOBucket] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket] ----- Original Message ----- Sent: Monday, June 25, 2001 6:07 PM Subject: Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)
This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...
This should not happen. :-(
I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.
I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-(
- C
Hi Giovanni, How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all? I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it. This is because it keeps a list of contained subobject names in an "_objects" attribute, which is a tuple. When an object is added, the tuple is rewritten in entirety. So for instance, if you've got 100 items in your folder, and you add one more, you rewrite all the instance data for the folder itself, which includes the (large) _objects tuple (and of course, any other raw attributes, like properties). Over time, this can be problematic. Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree). Are you breaking the content up into subfolders? This is recommended. I'm temped to postulate that perhaps your problem isn't as much ZCatalog as it is ObjectManager overhead. - C Giovanni Maruzzelli wrote:
Hello Zopistas,
thank'you all for your replies.
Our doubts still unresolved :-(
With a clever hack that Toby Dickenson made on the very useful tranalyzer, we was able to see what happen when we add or catalog an object. (BTW, we don't use CatalogAware).
We can send the output of tranalyzer2 to anyone interested, but in short words this is what happens in an empty folder (and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K):
if we add a normal DTML document (no catalog involved) in an empty folder we have a very small increase in size: the size of the dtml and the size of the folder:
TID: 33D853C2CE6CDBB @ 77396692 obs 2 len 729 By ciao "/aacucu/addDTMLDocument" OID: 40817 len 270 [OFS.Folder.Folder] OID: 40818 len 309 [OFS.DTMLDocument.DTMLDocument]
if we add an "Articolo" that's cataloged on the fly in the same empty directory we have a bloating:
TID: 33D853D722FA167 @ 77397437 obs 96 len 226568 By ciao "/aacucu/Articolo_add" OID: 40817 len 363 [OFS.Folder.Folder] OID: 40819 len 598 [*ennPsHQQKY5zjxlQs1ebmA==.Articolo] OID: 407b5 len 8074 [BTrees.IOBTree.IOBucket] OID: 37aa9 len 39 [BTrees.Length.Length] OID: 37b95 len 1483 [BTrees.OIBTree.OIBucket] OID: 407b7 len 1739 [BTrees.IOBTree.IOBucket] OID: 407b8 len 402 [BTrees.IIBTree.IISet] OID: 407b9 len 399 [BTrees.IOBTree.IOBucket] OID: 407ba len 402 [BTrees.IIBTree.IISet] OID: 407bb len 3497 [BTrees.IOBTree.IOBucket] OID: 407bc len 5871 [BTrees.OOBTree.OOBucket] OID: 37ab2 len 39 [BTrees.Length.Length] OID: 407c6 len 6279 [BTrees.IOBTree.IOBucket] OID: 3d7bf len 312 [BTrees.IIBTree.IISet] OID: 407c7 len 4507 [BTrees.IOBTree.IOBucket] OID: 3c992 len 837 [BTrees.OOBTree.OOBucket] OID: 37abe len 39 [BTrees.Length.Length] OID: 407d2 len 696 [BTrees.IOBTree.IOBucket] OID: 3cb4e len 572 [BTrees.IIBTree.IISet] OID: 407d3 len 537 [BTrees.IOBTree.IOBucket] OID: 40809 len 387 [BTrees.IIBTree.IISet] OID: 407d4 len 507 [BTrees.IOBTree.IOBucket] OID: 4080a len 387 [BTrees.IIBTree.IISet] OID: 407d5 len 507 [BTrees.IOBTree.IOBucket] OID: 4080b len 387 [BTrees.IIBTree.IISet] OID: 407d6 len 507 [BTrees.IOBTree.IOBucket] OID: 4080c len 387 [BTrees.IIBTree.IISet] OID: 407d7 len 339 [BTrees.IOBTree.IOBucket] OID: 4080d len 382 [BTrees.IIBTree.IISet] OID: 407d8 len 339 [BTrees.IOBTree.IOBucket] OID: 4080e len 382 [BTrees.IIBTree.IISet] OID: 407d9 len 339 [BTrees.IOBTree.IOBucket] OID: 3d064 len 597 [BTrees.IIBTree.IISet] OID: 407da len 347 [BTrees.IOBTree.IOBucket] OID: 4080f len 387 [BTrees.IIBTree.IISet] OID: 407db len 339 [BTrees.IOBTree.IOBucket] OID: 3d1ba len 642 [BTrees.IIBTree.IISet] OID: 407dc len 339 [BTrees.IOBTree.IOBucket] OID: 40810 len 372 [BTrees.IIBTree.IISet] OID: 407dd len 339 [BTrees.IOBTree.IOBucket] OID: 40811 len 372 [BTrees.IIBTree.IISet] OID: 407de len 339 [BTrees.IOBTree.IOBucket] OID: 37f11 len 977 [BTrees.IOBTree.IOBucket] OID: 380de len 830 [BTrees.OIBTree.OIBucket] OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37ac7 len 9892 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 38922 len 387 [BTrees.IIBTree.IISet] OID: 38643 len 827 [BTrees.IIBTree.IISet] OID: 3894c len 92 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 38581 len 277 [BTrees.IIBTree.IISet] OID: 3d7f7 len 319 [BTrees.IOBTree.IOBTree] OID: 3d7f8 len 356 [BTrees.IOBTree.IOBTree] OID: 40812 len 372 [BTrees.IIBTree.IISet] OID: 407e0 len 339 [BTrees.IOBTree.IOBucket] OID: 40813 len 387 [BTrees.IIBTree.IISet] OID: 407e1 len 339 [BTrees.IOBTree.IOBucket] OID: 40814 len 362 [BTrees.IIBTree.IISet] OID: 407e2 len 507 [BTrees.IOBTree.IOBucket] OID: 37eb9 len 981 [BTrees.IOBTree.IOBucket] OID: 38197 len 804 [BTrees.OIBTree.OIBucket] OID: 38ac7 len 7947 [BTrees.IIBTree.IISet] OID: 387f6 len 97 [BTrees.IIBTree.IISet] OID: 383f7 len 850 [BTrees.OOBTree.OOBucket] OID: 4081a len 47 [BTrees.IIBTree.IISet] OID: 38407 len 850 [BTrees.OOBTree.OOBucket] OID: 4081b len 47 [BTrees.IIBTree.IISet] OID: 388ac len 92 [BTrees.IIBTree.IISet] OID: 387d4 len 152 [BTrees.IIBTree.IISet] OID: 3868c len 152 [BTrees.IIBTree.IISet] OID: 38681 len 142 [BTrees.IIBTree.IISet] OID: 388b0 len 72 [BTrees.IIBTree.IISet] OID: 384f1 len 52 [BTrees.IIBTree.IISet] OID: 37ca6 len 586 [BTrees.IOBTree.IOBucket] OID: 4081c len 686 [BTrees.IOBTree.IOBucket] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 381d8 len 594 [BTrees.OIBTree.OIBucket] OID: 38ac9 len 1252 [BTrees.IIBTree.IISet] OID: 38770 len 52 [BTrees.IIBTree.IISet] OID: 37d94 len 1234 [BTrees.IOBTree.IOBucket] OID: 3821d len 617 [BTrees.OIBTree.OIBucket] OID: 38acb len 557 [BTrees.IIBTree.IISet] OID: 38710 len 52 [BTrees.IIBTree.IISet] OID: 386ac len 52 [BTrees.IIBTree.IISet] OID: 38409 len 1019 [BTrees.OOBTree.OOBucket] OID: 4081d len 47 [BTrees.IIBTree.IISet] OID: 3870b len 52 [BTrees.IIBTree.IISet] OID: 38403 len 816 [BTrees.OOBTree.OOBucket] OID: 4081e len 47 [BTrees.IIBTree.IISet] OID: 387fe len 57 [BTrees.IIBTree.IISet] OID: 387cc len 67 [BTrees.IIBTree.IISet] OID: 38b29 len 1228 [BTrees.IOBTree.IOBucket] OID: 38c19 len 904 [BTrees.IOBTree.IOBucket] OID: 38d37 len 1007 [BTrees.IOBTree.IOBucket] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket]
----- Original Message ----- Sent: Monday, June 25, 2001 6:07 PM Subject: Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)
This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...
This should not happen. :-(
I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.
I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-(
- C
Giovanni, which Zope version are you running? On Tue, 26 Jun 2001, Chris McDonough wrote:
How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?
In my installation I have about 30 or 40 Position(Text)Index/KeywordIndex/FieldIndex. They don't bloat much, so I don't think that's the problem. (The problem might be that we have different views on what bloating is, though :)
On Tue, 26 Jun 2001 06:45:54 -0400, Chris McDonough <chrism@digicool.com> wrote:
I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it.
What Chris describes would be a prudent change anyway, however I dont think it is the root of this problem. The tranalyzer output shows the following line for the Folder. At a length of 363 I guess it is pretty empty. Even if this object grows to 100k (when adding the 100th item) it is not the single biggest cause of bloat to the total transaction size. (incidentally, it *was* the cause of the bloat problems that led me to develop this patched tranalyzer)
OID: 40817 len 363 [OFS.Folder.Folder]
The following entries I do find interesting. They are all somewhat larger that I remember seeing before. Are you indexing *large* properties (or storing large metadata values)? It may be interesting to see the raw pickle data for these large objects...... my patched tranalyzer can do that too.
OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket]
Toby Dickenson tdickenson@geminidataloggers.com
Hi Giovanni, Chris and all others, Chris McDonough wrote:
Hi Giovanni,
How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?
I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it. This is because it keeps a list of contained subobject names in an "_objects" attribute, which is a tuple. When an object is added, the tuple is rewritten in entirety. So for instance, if you've got 100 items in your folder, and you add one more, you rewrite all the instance data for the folder itself, which includes the (large) _objects tuple (and of course, any other raw attributes, like properties). Over time, this can be problematic.
Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree).
Are you breaking the content up into subfolders? This is recommended.
I'm temped to postulate that perhaps your problem isn't as much ZCatalog as it is ObjectManager overhead.
Well, I'm not very familiar with the details about the sub-object management of ObjectManager and friends. Moreover, I had yet a closer look only into UnTextIndex, but not into UnIndex or UnKeywordIndex. So take my comments with a grain of salt. A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes. If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text. This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog". Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions. OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents. Abel
abel deuring wrote:
A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.
each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes.
If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.
Nah... I don't think so. At least I hope not! Each bucket in a BTree is a separate persistent object. So only the sum of the data in the updated buckets will be appended to the ZODB. So if you add an item to a BTree, you don't add 24000+ bytes for each update. You just add the amount of space taken up by the bucket... unfortunately I don't know exactly how much this is, but I'd imagine it's pretty close to the datasize with only a little overhead.
This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog".
Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions.
OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents.
Another option is to use a session manager, and update the catalog at session-end. - C
Chris McDonough wrote:
abel deuring wrote:
A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.
[snip] What if any disadvantages are there to not calling unindex_object first? If there aren't any good ones, I think I'll be rewriting some of my own "CatalogAware" code... -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>
On Tue, 26 Jun 2001 09:31:02 -0400, Chris McDonough <chrism@digicool.com> wrote:
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.
CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are: a. Unindexing before ReIndexing causes bloat by defeating the catalogs change-detection tricks. b. It uses URLs not paths, and so doesnt play right with virtual hosting c. It uses the same hooks as ObjectManager to detect that it has been added/removed from a container ObjectManager, and therefore the two cant be easily mixed together as base classes. All of these are fixable, and I feel a patch coming on. Are there some deeper problems I am not aware of? Toby Dickenson tdickenson@geminidataloggers.com
I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility. - C Toby Dickenson wrote:
On Tue, 26 Jun 2001 09:31:02 -0400, Chris McDonough <chrism@digicool.com> wrote:
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.
CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are:
a. Unindexing before ReIndexing causes bloat by defeating the catalogs change-detection tricks.
b. It uses URLs not paths, and so doesnt play right with virtual hosting
c. It uses the same hooks as ObjectManager to detect that it has been added/removed from a container ObjectManager, and therefore the two cant be easily mixed together as base classes.
All of these are fixable, and I feel a patch coming on.
Are there some deeper problems I am not aware of?
Toby Dickenson tdickenson@geminidataloggers.com
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
Chris McDonough <chrism@digicool.com> wrote:
I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility.
Not a patch, but Ive fixed all three known CatalogAware problems in a separate product; a new base class that derives from CatalogAware: http://www.zope.org/Members/htrd/BetterCatalogAware/ The techniques used in this product have been thoroughly stressed in several other production systems, but this is the first time they have been collected together in one place so bugs are possible.
That makes CatalogAware much saner and will produce less bloat. Actually, maybe I should just go make that change in the trunk and the 2.4 branch, although I'm a little afraid of what (if anything) it will break for everybody. To be honest, I really don't have much time to spend thinking about this, and my fears are probably just FUD.
Im not sure how many people are using CatalogAware; I think many serious users have been scared off by the problem reports in the list archives. IMO fixing this may be worth a little breakage. Toby Dickenson tdickenson@geminidataloggers.com
Excellent, thanks so much Toby. Maybe some feedback will come in... - C Toby Dickenson wrote:
Chris McDonough <chrism@digicool.com> wrote:
I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility.
Not a patch, but Ive fixed all three known CatalogAware problems in a separate product; a new base class that derives from CatalogAware:
http://www.zope.org/Members/htrd/BetterCatalogAware/
The techniques used in this product have been thoroughly stressed in several other production systems, but this is the first time they have been collected together in one place so bugs are possible.
That makes CatalogAware much saner and will produce less bloat. Actually, maybe I should just go make that change in the trunk and the 2.4 branch, although I'm a little afraid of what (if anything) it will break for everybody. To be honest, I really don't have much time to spend thinking about this, and my fears are probably just FUD.
Im not sure how many people are using CatalogAware; I think many serious users have been scared off by the problem reports in the list archives.
IMO fixing this may be worth a little breakage.
Toby Dickenson tdickenson@geminidataloggers.com
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
**************************************************************** Subject: [Zope] CatalogAware
CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are: <snip>
b. It uses URLs not paths, and so doesnt play right with virtual hosting
***************************************************************** I ran into this problem using VHMonster with my EventFolder product and found a work-around, just for anyone who might be struggling with this See http://www.netkook.com/Members/jeff/ef/faq/document_view#vhost This article discusses how to use _vh_ with VHM. (boy does that sound cryptic...) Jeff Sasmor jeff@sasmor.com www.netkook.com
participants (8)
-
abel deuring -
barry@digicool.com -
Casey Duncan -
Chris McDonough -
Erik Enge -
Giovanni Maruzzelli -
Jeff Sasmor -
Toby Dickenson