Zcatalog bloat problem (berkeleydb is a solution?)

Giovanni Maruzzelli

25 Jun 2001 25 Jun '01

12:34 p.m.

Hello Zopistas, we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index and reindex some tens of thousands objects (Zclass that are DTMLDocument on steroids) on some twenty properties each day, while the absolute number of objects cataloged keeps growing (think at content management for a big portal, where each day lots of content are added and modified and all the old content remains as a searchable archive and as material to recycle in the future). This seems for some aspects a task similar to what Erik Enge impacted couple a weeks ago. We first derived from CatalogAware, then switched to manage ourselves the cataloging - uncataloging - recataloging. The ZODB still bloat at a too much fast pace. ***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). If we pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms). Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided? Erik, any toughts? We are almost decided to switch to berkeleydb storage (the Minimal one) to get rid of the bloating, we are testing with it, but it seems to be discontinued because of a lack of requests. Any light on it? Is it production grade? -giovanni

Show replies by date

barry＠digicool.com

25 Jun 25 Jun

2:45 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

...
...
...
...
"GM" == Giovanni Maruzzelli <maruzz@open4.it> writes:

GM> We are almost decided to switch to berkeleydb storage (the GM> Minimal one) to get rid of the bloating, we are testing with GM> it, but it seems to be discontinued because of a lack of GM> requests. GM> Any light on it? Is it production grade? There are currently two versions of non-undo, non-versioning Berkeley based storages. One is called Packless.py and one is called Minimal.py. Packless was written first, and is truly packless; it uses reference counting to get rid of unused objects. Minimal isn't quite packless yet, since the reference counting hasn't been added. OTOH, Minimal shares much as much implementation as possible with Full, so it shares e.g. the more robust backing commit log file stuff. A goal, although not likely for 1.0 is to merge the best features of Minimal and Packless into a single storage. I don't think it's a lot of work to do this, but Jim hasn't pressed this has as having a very high priority. -Barry

abel deuring

3:49 p.m.

New subject: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

Giovanni Maruzzelli wrote:

...

Hello Zopistas,

we are developing a Zope 2.3.3 (py 1.5.2) application that will add, index and reindex some tens of thousands objects (Zclass that are DTMLDocument on steroids) on some twenty properties each day, while the absolute number of objects cataloged keeps growing (think at content management for a big portal, where each day lots of content are added and modified and all the old content remains as a searchable archive and as material to recycle in the future).

This seems for some aspects a task similar to what Erik Enge impacted couple a weeks ago.

We first derived from CatalogAware, then switched to manage ourselves the cataloging - uncataloging - recataloging.

The ZODB still bloat at a too much fast pace.

***Maybe there's something obvious we missed***, but when you have some 4thousands object in the catalog, if you add and catalog one more object the ZODB grows circa a couple of megabyte (while the object is some 1 k of text, and some twelve boolean and datetime and strings properties). If we pack the ZODB, Data.fs returns to an almost normal size (so the bloat are made by the transactions as tranalyzer.py confirms).

Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided?

Erik, any toughts?

We are almost decided to switch to berkeleydb storage (the Minimal one) to get rid of the bloating, we are testing with it, but it seems to be discontinued because of a lack of requests.

Any light on it? Is it production grade?

Giovanni, I experienced similar problems trying to catalog ·~200000 objects with ~500 MB text. Using CatalogAware objects will indeed lead to a "really fat" data base, and using the "find objects" for a ZCatalog requires considerable resources. A text index (more precise: the class UnTextIndex) works, as far as I understood it, this way: 1. The method UnTextIndex.index_object splits the text into single words, using the method [Globbing]Lexicon.Splitter. 2. UnTextIndex.index_object looks up the wordID (an integer) of each word in the lexicon. If a word is not yet listed in the lexicon, it is added to the lexicon. 3. All wordIDs are inserted into self._index, which maps wordIDs to the list of documents containing this word. 4. The "unindex" BTree , which maps the documentIds to the the list of all words appearing in an document is updated. If you are adding only one CatalogAware object during a transaction, this is quite expensive: Even if the indexed object contains only one new word, the entire lexicon needs to be updated. In my tests with the 200000 objects (containing ordinary German texts) the lexicon contained ~ 1 million words. (BTW, I had not had a very close look into the contents of the lexicon, so I don't know yet exactly, why it is so large. But I noticed quite many entries like "38-jährige", "42-jährige" ("NN-year-old") entries. So a configurable splitter method might help quite much to reduce the size of the lexicon.) Hence, the above mentioned step 2 alone can result in a really bloated data base. A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...) But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough... As I see it, the main problem is that UnTextIndex.index_object tries to do all work at once: Updating the lexicon _and_ self._index _and_ self._unindex So I tried to separate these tasks by writing the data to be stored in self._index (wordId, documentId, score) into a pipe. This pipe is connected to sort(1). After all objects have been "scanned", the pipe is closed, the sorted results are read back and self._index is updated. This way, Zope needed "only", uuhh, somewhat aroud 200 or 300 MB RAM. A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response. Abel

Chris McDonough

4:07 p.m.

New subject: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

...

A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...

But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-( I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.

...

A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.

I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-( - C

Chris Withers

4:58 p.m.

New subject: [Zope] Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough wrote:

...

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

Just to add another data point, we're still having issues if we catalog-as-you go when trying to recreate our mailing list archives in Zope. As I understand it, the guys managed to get it to work by importing 28,000 odd message and then indexing, rather than indexing as each one was added. This was using Zope 2.3.2, should it be expected? cheers, Chris PS: Andy D was going to post this but he went home ill, I don't think that was ZCatalog related ;-)

Chris McDonough

6:35 p.m.

New subject: [Zope] Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

...

Chris McDonough wrote:

...
This purpose aligns well with those of the ArmoredCatalog proposal as

well..

...
see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

Just to add another data point, we're still having issues if we catalog-as-you go when trying to recreate our mailing list archives in Zope. As I understand it, the guys managed to get it to work by importing 28,000 odd message and then indexing, rather than indexing as each one was added. This was using Zope 2.3.2, should it be expected?

No.

abel deuring

5:14 p.m.

New subject: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough wrote:

...

...
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

I'm afraid that I did not understand much of the discussion on this page. (But I don't intend to become a ZODB developer, so I'll simply ignore it...) But if I'm right, this "lazy catalog awareness" would mainly mean that ArmoredCatalog gets "official API calls" (1) to add object to an update or delete list and (2) to index/unindex the objects in these list. I think that this would be really useful.

...

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.

...
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.

I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening.

Chris, never mind :) It's also my fault: I'm contended with reading a mailing list (and sometimes making more or less clever comments), and I know that Zope has far more elaborate discussion systems, like Wikis and fish bowls. Only that I'm too lazy to scan through all this stuff to find a better place for comments... And I know that it's fairly easy to miss a mail :) Abel

Giovanni Maruzzelli

26 Jun 26 Jun

10:06 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Hello Zopistas, thank'you all for your replies. Our doubts still unresolved :-( With a clever hack that Toby Dickenson made on the very useful tranalyzer, we was able to see what happen when we add or catalog an object. (BTW, we don't use CatalogAware). We can send the output of tranalyzer2 to anyone interested, but in short words this is what happens in an empty folder (and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K): if we add a normal DTML document (no catalog involved) in an empty folder we have a very small increase in size: the size of the dtml and the size of the folder: TID: 33D853C2CE6CDBB @ 77396692 obs 2 len 729 By ciao "/aacucu/addDTMLDocument" OID: 40817 len 270 [OFS.Folder.Folder] OID: 40818 len 309 [OFS.DTMLDocument.DTMLDocument] if we add an "Articolo" that's cataloged on the fly in the same empty directory we have a bloating: TID: 33D853D722FA167 @ 77397437 obs 96 len 226568 By ciao "/aacucu/Articolo_add" OID: 40817 len 363 [OFS.Folder.Folder] OID: 40819 len 598 [*ennPsHQQKY5zjxlQs1ebmA==.Articolo] OID: 407b5 len 8074 [BTrees.IOBTree.IOBucket] OID: 37aa9 len 39 [BTrees.Length.Length] OID: 37b95 len 1483 [BTrees.OIBTree.OIBucket] OID: 407b7 len 1739 [BTrees.IOBTree.IOBucket] OID: 407b8 len 402 [BTrees.IIBTree.IISet] OID: 407b9 len 399 [BTrees.IOBTree.IOBucket] OID: 407ba len 402 [BTrees.IIBTree.IISet] OID: 407bb len 3497 [BTrees.IOBTree.IOBucket] OID: 407bc len 5871 [BTrees.OOBTree.OOBucket] OID: 37ab2 len 39 [BTrees.Length.Length] OID: 407c6 len 6279 [BTrees.IOBTree.IOBucket] OID: 3d7bf len 312 [BTrees.IIBTree.IISet] OID: 407c7 len 4507 [BTrees.IOBTree.IOBucket] OID: 3c992 len 837 [BTrees.OOBTree.OOBucket] OID: 37abe len 39 [BTrees.Length.Length] OID: 407d2 len 696 [BTrees.IOBTree.IOBucket] OID: 3cb4e len 572 [BTrees.IIBTree.IISet] OID: 407d3 len 537 [BTrees.IOBTree.IOBucket] OID: 40809 len 387 [BTrees.IIBTree.IISet] OID: 407d4 len 507 [BTrees.IOBTree.IOBucket] OID: 4080a len 387 [BTrees.IIBTree.IISet] OID: 407d5 len 507 [BTrees.IOBTree.IOBucket] OID: 4080b len 387 [BTrees.IIBTree.IISet] OID: 407d6 len 507 [BTrees.IOBTree.IOBucket] OID: 4080c len 387 [BTrees.IIBTree.IISet] OID: 407d7 len 339 [BTrees.IOBTree.IOBucket] OID: 4080d len 382 [BTrees.IIBTree.IISet] OID: 407d8 len 339 [BTrees.IOBTree.IOBucket] OID: 4080e len 382 [BTrees.IIBTree.IISet] OID: 407d9 len 339 [BTrees.IOBTree.IOBucket] OID: 3d064 len 597 [BTrees.IIBTree.IISet] OID: 407da len 347 [BTrees.IOBTree.IOBucket] OID: 4080f len 387 [BTrees.IIBTree.IISet] OID: 407db len 339 [BTrees.IOBTree.IOBucket] OID: 3d1ba len 642 [BTrees.IIBTree.IISet] OID: 407dc len 339 [BTrees.IOBTree.IOBucket] OID: 40810 len 372 [BTrees.IIBTree.IISet] OID: 407dd len 339 [BTrees.IOBTree.IOBucket] OID: 40811 len 372 [BTrees.IIBTree.IISet] OID: 407de len 339 [BTrees.IOBTree.IOBucket] OID: 37f11 len 977 [BTrees.IOBTree.IOBucket] OID: 380de len 830 [BTrees.OIBTree.OIBucket] OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37ac7 len 9892 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 38922 len 387 [BTrees.IIBTree.IISet] OID: 38643 len 827 [BTrees.IIBTree.IISet] OID: 3894c len 92 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 38581 len 277 [BTrees.IIBTree.IISet] OID: 3d7f7 len 319 [BTrees.IOBTree.IOBTree] OID: 3d7f8 len 356 [BTrees.IOBTree.IOBTree] OID: 40812 len 372 [BTrees.IIBTree.IISet] OID: 407e0 len 339 [BTrees.IOBTree.IOBucket] OID: 40813 len 387 [BTrees.IIBTree.IISet] OID: 407e1 len 339 [BTrees.IOBTree.IOBucket] OID: 40814 len 362 [BTrees.IIBTree.IISet] OID: 407e2 len 507 [BTrees.IOBTree.IOBucket] OID: 37eb9 len 981 [BTrees.IOBTree.IOBucket] OID: 38197 len 804 [BTrees.OIBTree.OIBucket] OID: 38ac7 len 7947 [BTrees.IIBTree.IISet] OID: 387f6 len 97 [BTrees.IIBTree.IISet] OID: 383f7 len 850 [BTrees.OOBTree.OOBucket] OID: 4081a len 47 [BTrees.IIBTree.IISet] OID: 38407 len 850 [BTrees.OOBTree.OOBucket] OID: 4081b len 47 [BTrees.IIBTree.IISet] OID: 388ac len 92 [BTrees.IIBTree.IISet] OID: 387d4 len 152 [BTrees.IIBTree.IISet] OID: 3868c len 152 [BTrees.IIBTree.IISet] OID: 38681 len 142 [BTrees.IIBTree.IISet] OID: 388b0 len 72 [BTrees.IIBTree.IISet] OID: 384f1 len 52 [BTrees.IIBTree.IISet] OID: 37ca6 len 586 [BTrees.IOBTree.IOBucket] OID: 4081c len 686 [BTrees.IOBTree.IOBucket] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 381d8 len 594 [BTrees.OIBTree.OIBucket] OID: 38ac9 len 1252 [BTrees.IIBTree.IISet] OID: 38770 len 52 [BTrees.IIBTree.IISet] OID: 37d94 len 1234 [BTrees.IOBTree.IOBucket] OID: 3821d len 617 [BTrees.OIBTree.OIBucket] OID: 38acb len 557 [BTrees.IIBTree.IISet] OID: 38710 len 52 [BTrees.IIBTree.IISet] OID: 386ac len 52 [BTrees.IIBTree.IISet] OID: 38409 len 1019 [BTrees.OOBTree.OOBucket] OID: 4081d len 47 [BTrees.IIBTree.IISet] OID: 3870b len 52 [BTrees.IIBTree.IISet] OID: 38403 len 816 [BTrees.OOBTree.OOBucket] OID: 4081e len 47 [BTrees.IIBTree.IISet] OID: 387fe len 57 [BTrees.IIBTree.IISet] OID: 387cc len 67 [BTrees.IIBTree.IISet] OID: 38b29 len 1228 [BTrees.IOBTree.IOBucket] OID: 38c19 len 904 [BTrees.IOBTree.IOBucket] OID: 38d37 len 1007 [BTrees.IOBTree.IOBucket] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket] ----- Original Message ----- Sent: Monday, June 25, 2001 6:07 PM Subject: Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

...

...
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.

...
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.

I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-(

- C

Chris McDonough

10:45 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Hi Giovanni, How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all? I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it. This is because it keeps a list of contained subobject names in an "_objects" attribute, which is a tuple. When an object is added, the tuple is rewritten in entirety. So for instance, if you've got 100 items in your folder, and you add one more, you rewrite all the instance data for the folder itself, which includes the (large) _objects tuple (and of course, any other raw attributes, like properties). Over time, this can be problematic. Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree). Are you breaking the content up into subfolders? This is recommended. I'm temped to postulate that perhaps your problem isn't as much ZCatalog as it is ObjectManager overhead. - C Giovanni Maruzzelli wrote:

...

Hello Zopistas,

thank'you all for your replies.

Our doubts still unresolved :-(

With a clever hack that Toby Dickenson made on the very useful tranalyzer, we was able to see what happen when we add or catalog an object. (BTW, we don't use CatalogAware).

We can send the output of tranalyzer2 to anyone interested, but in short words this is what happens in an empty folder (and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K):

if we add a normal DTML document (no catalog involved) in an empty folder we have a very small increase in size: the size of the dtml and the size of the folder:

TID: 33D853C2CE6CDBB @ 77396692 obs 2 len 729 By ciao "/aacucu/addDTMLDocument" OID: 40817 len 270 [OFS.Folder.Folder] OID: 40818 len 309 [OFS.DTMLDocument.DTMLDocument]

if we add an "Articolo" that's cataloged on the fly in the same empty directory we have a bloating:

TID: 33D853D722FA167 @ 77397437 obs 96 len 226568 By ciao "/aacucu/Articolo_add" OID: 40817 len 363 [OFS.Folder.Folder] OID: 40819 len 598 [*ennPsHQQKY5zjxlQs1ebmA==.Articolo] OID: 407b5 len 8074 [BTrees.IOBTree.IOBucket] OID: 37aa9 len 39 [BTrees.Length.Length] OID: 37b95 len 1483 [BTrees.OIBTree.OIBucket] OID: 407b7 len 1739 [BTrees.IOBTree.IOBucket] OID: 407b8 len 402 [BTrees.IIBTree.IISet] OID: 407b9 len 399 [BTrees.IOBTree.IOBucket] OID: 407ba len 402 [BTrees.IIBTree.IISet] OID: 407bb len 3497 [BTrees.IOBTree.IOBucket] OID: 407bc len 5871 [BTrees.OOBTree.OOBucket] OID: 37ab2 len 39 [BTrees.Length.Length] OID: 407c6 len 6279 [BTrees.IOBTree.IOBucket] OID: 3d7bf len 312 [BTrees.IIBTree.IISet] OID: 407c7 len 4507 [BTrees.IOBTree.IOBucket] OID: 3c992 len 837 [BTrees.OOBTree.OOBucket] OID: 37abe len 39 [BTrees.Length.Length] OID: 407d2 len 696 [BTrees.IOBTree.IOBucket] OID: 3cb4e len 572 [BTrees.IIBTree.IISet] OID: 407d3 len 537 [BTrees.IOBTree.IOBucket] OID: 40809 len 387 [BTrees.IIBTree.IISet] OID: 407d4 len 507 [BTrees.IOBTree.IOBucket] OID: 4080a len 387 [BTrees.IIBTree.IISet] OID: 407d5 len 507 [BTrees.IOBTree.IOBucket] OID: 4080b len 387 [BTrees.IIBTree.IISet] OID: 407d6 len 507 [BTrees.IOBTree.IOBucket] OID: 4080c len 387 [BTrees.IIBTree.IISet] OID: 407d7 len 339 [BTrees.IOBTree.IOBucket] OID: 4080d len 382 [BTrees.IIBTree.IISet] OID: 407d8 len 339 [BTrees.IOBTree.IOBucket] OID: 4080e len 382 [BTrees.IIBTree.IISet] OID: 407d9 len 339 [BTrees.IOBTree.IOBucket] OID: 3d064 len 597 [BTrees.IIBTree.IISet] OID: 407da len 347 [BTrees.IOBTree.IOBucket] OID: 4080f len 387 [BTrees.IIBTree.IISet] OID: 407db len 339 [BTrees.IOBTree.IOBucket] OID: 3d1ba len 642 [BTrees.IIBTree.IISet] OID: 407dc len 339 [BTrees.IOBTree.IOBucket] OID: 40810 len 372 [BTrees.IIBTree.IISet] OID: 407dd len 339 [BTrees.IOBTree.IOBucket] OID: 40811 len 372 [BTrees.IIBTree.IISet] OID: 407de len 339 [BTrees.IOBTree.IOBucket] OID: 37f11 len 977 [BTrees.IOBTree.IOBucket] OID: 380de len 830 [BTrees.OIBTree.OIBucket] OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37ac7 len 9892 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 38922 len 387 [BTrees.IIBTree.IISet] OID: 38643 len 827 [BTrees.IIBTree.IISet] OID: 3894c len 92 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 38581 len 277 [BTrees.IIBTree.IISet] OID: 3d7f7 len 319 [BTrees.IOBTree.IOBTree] OID: 3d7f8 len 356 [BTrees.IOBTree.IOBTree] OID: 40812 len 372 [BTrees.IIBTree.IISet] OID: 407e0 len 339 [BTrees.IOBTree.IOBucket] OID: 40813 len 387 [BTrees.IIBTree.IISet] OID: 407e1 len 339 [BTrees.IOBTree.IOBucket] OID: 40814 len 362 [BTrees.IIBTree.IISet] OID: 407e2 len 507 [BTrees.IOBTree.IOBucket] OID: 37eb9 len 981 [BTrees.IOBTree.IOBucket] OID: 38197 len 804 [BTrees.OIBTree.OIBucket] OID: 38ac7 len 7947 [BTrees.IIBTree.IISet] OID: 387f6 len 97 [BTrees.IIBTree.IISet] OID: 383f7 len 850 [BTrees.OOBTree.OOBucket] OID: 4081a len 47 [BTrees.IIBTree.IISet] OID: 38407 len 850 [BTrees.OOBTree.OOBucket] OID: 4081b len 47 [BTrees.IIBTree.IISet] OID: 388ac len 92 [BTrees.IIBTree.IISet] OID: 387d4 len 152 [BTrees.IIBTree.IISet] OID: 3868c len 152 [BTrees.IIBTree.IISet] OID: 38681 len 142 [BTrees.IIBTree.IISet] OID: 388b0 len 72 [BTrees.IIBTree.IISet] OID: 384f1 len 52 [BTrees.IIBTree.IISet] OID: 37ca6 len 586 [BTrees.IOBTree.IOBucket] OID: 4081c len 686 [BTrees.IOBTree.IOBucket] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 381d8 len 594 [BTrees.OIBTree.OIBucket] OID: 38ac9 len 1252 [BTrees.IIBTree.IISet] OID: 38770 len 52 [BTrees.IIBTree.IISet] OID: 37d94 len 1234 [BTrees.IOBTree.IOBucket] OID: 3821d len 617 [BTrees.OIBTree.OIBucket] OID: 38acb len 557 [BTrees.IIBTree.IISet] OID: 38710 len 52 [BTrees.IIBTree.IISet] OID: 386ac len 52 [BTrees.IIBTree.IISet] OID: 38409 len 1019 [BTrees.OOBTree.OOBucket] OID: 4081d len 47 [BTrees.IIBTree.IISet] OID: 3870b len 52 [BTrees.IIBTree.IISet] OID: 38403 len 816 [BTrees.OOBTree.OOBucket] OID: 4081e len 47 [BTrees.IIBTree.IISet] OID: 387fe len 57 [BTrees.IIBTree.IISet] OID: 387cc len 67 [BTrees.IIBTree.IISet] OID: 38b29 len 1228 [BTrees.IOBTree.IOBucket] OID: 38c19 len 904 [BTrees.IOBTree.IOBucket] OID: 38d37 len 1007 [BTrees.IOBTree.IOBucket] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket]

----- Original Message ----- Sent: Monday, June 25, 2001 6:07 PM Subject: Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

...
...
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.

...
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.

I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-(

- C

Chris Withers

11:01 a.m.

New subject: [Zope-dev] ObjectManager Bloat (was Re: [Zope] Re: Zcatalog bloat problem (berkeleydb is a solution?))

Chris McDonough wrote:

...

Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree).

Are you breaking the content up into subfolders? This is recommended.

Do you still need to do this if you're using a BTreeFolder? cheers, Chris

Chris McDonough

11:21 a.m.

New subject: [Zope-dev] Re: ObjectManager Bloat (was Re: [Zope] Re: Zcatalog bloat problem (berkeleydb is a solution?))

Chris Withers wrote:

...

Chris McDonough wrote:

...
Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree).

Are you breaking the content up into subfolders? This is recommended.

Do you still need to do this if you're using a BTreeFolder?

It doesn't hurt, but likely no. If at all, you'd want to do it so management interface views would be sane. Then again, I've never actually used BTreeFolder. ;-) - C

Chris Withers

11:16 a.m.

New subject: [Zope-dev] Re: ObjectManager Bloat

Chris McDonough wrote:

...

It doesn't hurt, but likely no. If at all, you'd want to do it so management interface views would be sane.

Then again, I've never actually used BTreeFolder. ;-)

Ah, you should, it's great :-) The management interface is different, so you don't have problems with lots of objects. Both the FreeZope server have BTreeFolders storing the accounts, and they've got between 500 and 1000 users on each server... I think they need to be updated to use the new BTrees, but I don't know if that's a problem... thanks to Shane, again :-) cheers, Chris

Giovanni Maruzzelli

11:30 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Hi Chris, I don't think this is a problem of ObjectManager, also if it contribute to the bloating. We do breaks the content in subfolders, but our subfolders easily grows to contains some hundred objects. Do you think that the number of indexes contribute to the bloating? If this is important, we can try to compact them in a littler number (eg: the boolean indexes can become a sort of bitmask, eliminate the meta_type, etc.). This is our indexes (cut and paste from the ZMI), and following there is our metadata : INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055 bflow3 Field Index 4,055 bflow4 Field Index 4,055 bflow5 Field Index 4,055 bflow6 Field Index 4,055 bflow7 Field Index 4,055 bflow8 Field Index 4,055 bflow9 Field Index 4,055 bobobase_modification_time Field Index 4,300 dflow0 Field Index 4,055 dflow1 Field Index 4,055 id Field Index 4,300 m_sflow0 Keyword Index 3,960 m_sflow1 Keyword Index 3,960 m_sflow2 Keyword Index 3,960 meta_type Field Index 4,300 pseudoId Text Index 4,054 revisore Keyword Index 4,055 title Text Index 3,844 METADATA: bobobase_modification_time id meta_type pseudoId title ----- Original Message ----- Sent: Tuesday, June 26, 2001 12:45 PM Subject: Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Hi Giovanni,

How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?

I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it. This is because it keeps a list of contained subobject names in an "_objects" attribute, which is a tuple. When an object is added, the tuple is rewritten in entirety. So for instance, if you've got 100 items in your folder, and you add one more, you rewrite all the instance data for the folder itself, which includes the (large) _objects tuple (and of course, any other raw attributes, like properties). Over time, this can be problematic.

Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree).

Are you breaking the content up into subfolders? This is recommended.

I'm temped to postulate that perhaps your problem isn't as much ZCatalog as it is ObjectManager overhead.

- C

Giovanni Maruzzelli wrote:

...
Hello Zopistas,

thank'you all for your replies.

Our doubts still unresolved :-(

With a clever hack that Toby Dickenson made on the very useful

tranalyzer,

...

...
we was able to see what happen when we add or catalog an object. (BTW, we don't use CatalogAware).

We can send the output of tranalyzer2 to anyone interested, but in short words this is what happens in an empty folder (and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K):

if we add a normal DTML document (no catalog involved) in an empty folder we have a very small increase in size: the size of the dtml and the size of the folder:

TID: 33D853C2CE6CDBB @ 77396692 obs 2 len 729 By ciao "/aacucu/addDTMLDocument" OID: 40817 len 270 [OFS.Folder.Folder] OID: 40818 len 309 [OFS.DTMLDocument.DTMLDocument]

if we add an "Articolo" that's cataloged on the fly in the same empty directory we have a bloating:

TID: 33D853D722FA167 @ 77397437 obs 96 len 226568 By ciao "/aacucu/Articolo_add" OID: 40817 len 363 [OFS.Folder.Folder] OID: 40819 len 598 [*ennPsHQQKY5zjxlQs1ebmA==.Articolo] OID: 407b5 len 8074 [BTrees.IOBTree.IOBucket] OID: 37aa9 len 39 [BTrees.Length.Length] OID: 37b95 len 1483 [BTrees.OIBTree.OIBucket] OID: 407b7 len 1739 [BTrees.IOBTree.IOBucket] OID: 407b8 len 402 [BTrees.IIBTree.IISet] OID: 407b9 len 399 [BTrees.IOBTree.IOBucket] OID: 407ba len 402 [BTrees.IIBTree.IISet] OID: 407bb len 3497 [BTrees.IOBTree.IOBucket] OID: 407bc len 5871 [BTrees.OOBTree.OOBucket] OID: 37ab2 len 39 [BTrees.Length.Length] OID: 407c6 len 6279 [BTrees.IOBTree.IOBucket] OID: 3d7bf len 312 [BTrees.IIBTree.IISet] OID: 407c7 len 4507 [BTrees.IOBTree.IOBucket] OID: 3c992 len 837 [BTrees.OOBTree.OOBucket] OID: 37abe len 39 [BTrees.Length.Length] OID: 407d2 len 696 [BTrees.IOBTree.IOBucket] OID: 3cb4e len 572 [BTrees.IIBTree.IISet] OID: 407d3 len 537 [BTrees.IOBTree.IOBucket] OID: 40809 len 387 [BTrees.IIBTree.IISet] OID: 407d4 len 507 [BTrees.IOBTree.IOBucket] OID: 4080a len 387 [BTrees.IIBTree.IISet] OID: 407d5 len 507 [BTrees.IOBTree.IOBucket] OID: 4080b len 387 [BTrees.IIBTree.IISet] OID: 407d6 len 507 [BTrees.IOBTree.IOBucket] OID: 4080c len 387 [BTrees.IIBTree.IISet] OID: 407d7 len 339 [BTrees.IOBTree.IOBucket] OID: 4080d len 382 [BTrees.IIBTree.IISet] OID: 407d8 len 339 [BTrees.IOBTree.IOBucket] OID: 4080e len 382 [BTrees.IIBTree.IISet] OID: 407d9 len 339 [BTrees.IOBTree.IOBucket] OID: 3d064 len 597 [BTrees.IIBTree.IISet] OID: 407da len 347 [BTrees.IOBTree.IOBucket] OID: 4080f len 387 [BTrees.IIBTree.IISet] OID: 407db len 339 [BTrees.IOBTree.IOBucket] OID: 3d1ba len 642 [BTrees.IIBTree.IISet] OID: 407dc len 339 [BTrees.IOBTree.IOBucket] OID: 40810 len 372 [BTrees.IIBTree.IISet] OID: 407dd len 339 [BTrees.IOBTree.IOBucket] OID: 40811 len 372 [BTrees.IIBTree.IISet] OID: 407de len 339 [BTrees.IOBTree.IOBucket] OID: 37f11 len 977 [BTrees.IOBTree.IOBucket] OID: 380de len 830 [BTrees.OIBTree.OIBucket] OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37ac7 len 9892 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 38922 len 387 [BTrees.IIBTree.IISet] OID: 38643 len 827 [BTrees.IIBTree.IISet] OID: 3894c len 92 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 38581 len 277 [BTrees.IIBTree.IISet] OID: 3d7f7 len 319 [BTrees.IOBTree.IOBTree] OID: 3d7f8 len 356 [BTrees.IOBTree.IOBTree] OID: 40812 len 372 [BTrees.IIBTree.IISet] OID: 407e0 len 339 [BTrees.IOBTree.IOBucket] OID: 40813 len 387 [BTrees.IIBTree.IISet] OID: 407e1 len 339 [BTrees.IOBTree.IOBucket] OID: 40814 len 362 [BTrees.IIBTree.IISet] OID: 407e2 len 507 [BTrees.IOBTree.IOBucket] OID: 37eb9 len 981 [BTrees.IOBTree.IOBucket] OID: 38197 len 804 [BTrees.OIBTree.OIBucket] OID: 38ac7 len 7947 [BTrees.IIBTree.IISet] OID: 387f6 len 97 [BTrees.IIBTree.IISet] OID: 383f7 len 850 [BTrees.OOBTree.OOBucket] OID: 4081a len 47 [BTrees.IIBTree.IISet] OID: 38407 len 850 [BTrees.OOBTree.OOBucket] OID: 4081b len 47 [BTrees.IIBTree.IISet] OID: 388ac len 92 [BTrees.IIBTree.IISet] OID: 387d4 len 152 [BTrees.IIBTree.IISet] OID: 3868c len 152 [BTrees.IIBTree.IISet] OID: 38681 len 142 [BTrees.IIBTree.IISet] OID: 388b0 len 72 [BTrees.IIBTree.IISet] OID: 384f1 len 52 [BTrees.IIBTree.IISet] OID: 37ca6 len 586 [BTrees.IOBTree.IOBucket] OID: 4081c len 686 [BTrees.IOBTree.IOBucket] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 381d8 len 594 [BTrees.OIBTree.OIBucket] OID: 38ac9 len 1252 [BTrees.IIBTree.IISet] OID: 38770 len 52 [BTrees.IIBTree.IISet] OID: 37d94 len 1234 [BTrees.IOBTree.IOBucket] OID: 3821d len 617 [BTrees.OIBTree.OIBucket] OID: 38acb len 557 [BTrees.IIBTree.IISet] OID: 38710 len 52 [BTrees.IIBTree.IISet] OID: 386ac len 52 [BTrees.IIBTree.IISet] OID: 38409 len 1019 [BTrees.OOBTree.OOBucket] OID: 4081d len 47 [BTrees.IIBTree.IISet] OID: 3870b len 52 [BTrees.IIBTree.IISet] OID: 38403 len 816 [BTrees.OOBTree.OOBucket] OID: 4081e len 47 [BTrees.IIBTree.IISet] OID: 387fe len 57 [BTrees.IIBTree.IISet] OID: 387cc len 67 [BTrees.IIBTree.IISet] OID: 38b29 len 1228 [BTrees.IOBTree.IOBucket] OID: 38c19 len 904 [BTrees.IOBTree.IOBucket] OID: 38d37 len 1007 [BTrees.IOBTree.IOBucket] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket]

----- Original Message ----- Sent: Monday, June 25, 2001 6:07 PM Subject: Re: [Zope-dev] Zcatalog bloat problem (berkeleydb is a solution?)

...
...
A solution might be a kind of "lazy catalog awareness": Instead of mangling a new object through one or more catalogs when it is created, this object could be added to a list of objects to be cataloged later. This way, the transaction to insert a new object would become much "cheaper". I'm working on this, but right now it is quite messy. (I'm new to Python and Zope, and hence I'm stumbling over a few, hmmm, trip-wires...)

This purpose aligns well with those of the ArmoredCatalog proposal as well.. see http://dev.zope.org/Wikis/DevSite/Proposals/ArmoredCatalog .

...
But even using such a "lazy catalog awareness", you might get into trouble. Using the ZCatalog's "find objects" function, I hit the limits of my Linux box: 640 MB RAM were not enough...

This should not happen. :-(

I'm really disappointed that the bloat and memory consumption issues are still plaguing the ZCatalog. At one point, I really thought we had it pretty much licked. I suppose this was naive.

...
A few weeks ago, I've posted this (admittedly not fully cooked) patch to this list, but did not get yet any response.

I apologize for this. We have a fairly formalized process for handling feature-ish collector issues, and this hasn't come round on the guitar. I'm beyond disappointed that people are still having unacceptable bloat, enough that something like this patch needed to be submitted. It's disheartening. :-(

- C

Chris McDonough

12:15 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Well, I'm not sure, unfortunately. I just wanted to get an idea of what kinds of indexes you had. The tranalyzer output doesn't mean too much to me, because it shows BTree buckets and such getting updated, which is completely understandable... there are at least two data structures in the Catalog itself that use a BTree, and each index uses at least two BTrees. So it's not all that surprising to see that output. What is suprising is to hear the amount of growth a transaction causes. The only thing I can think of is that: a) you're committing inappropriately (at times where it would be OK to not commit) b) the data fields your indexing or getting metadata from are large. c) something awful happened between 2.3.2 and 2.3.3 that I dont understand. d) the problem is unrelated to the Catalog. I'm afraid I can't be any more precise than that. -C Giovanni Maruzzelli wrote:

...

Hi Chris,

I don't think this is a problem of ObjectManager, also if it contribute to the bloating.

We do breaks the content in subfolders, but our subfolders easily grows to contains some hundred objects.

Do you think that the number of indexes contribute to the bloating? If this is important, we can try to compact them in a littler number (eg: the boolean indexes can become a sort of bitmask, eliminate the meta_type, etc.).

This is our indexes (cut and paste from the ZMI), and following there is our metadata :

INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055 bflow3 Field Index 4,055 bflow4 Field Index 4,055 bflow5 Field Index 4,055 bflow6 Field Index 4,055 bflow7 Field Index 4,055 bflow8 Field Index 4,055 bflow9 Field Index 4,055 bobobase_modification_time Field Index 4,300 dflow0 Field Index 4,055 dflow1 Field Index 4,055 id Field Index 4,300 m_sflow0 Keyword Index 3,960 m_sflow1 Keyword Index 3,960 m_sflow2 Keyword Index 3,960 meta_type Field Index 4,300 pseudoId Text Index 4,054 revisore Keyword Index 4,055 title Text Index 3,844

METADATA:

bobobase_modification_time id meta_type pseudoId title

Toby Dickenson

3:49 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055

Aha! a clue. If that is the output of the 'Indexes' tab then I dont think you are using the newest ZCatalog. A recent release (im not surwe which, 2.3.2?) has a new BTree implementation that reduces bloat by modifying fewer buckets (it also doesnt have the column showing index size) Toby Dickenson tdickenson@geminidataloggers.com

Chris Withers

3:59 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Toby Dickenson wrote:

...

...
INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055

Aha! a clue.

If that is the output of the 'Indexes' tab then I dont think you are using the newest ZCatalog. A recent release (im not surwe which, 2.3.2?) has a new BTree implementation that reduces bloat by modifying fewer buckets (it also doesnt have the column showing index size)

Has the person concerned run the catalog update tool when they upgraded their Zope version? cheers, Chris

Giovanni Maruzzelli

4:10 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

The catalog is a pristine 2.3.3b1 catalog. We have recreated the catalog from scratch because we tried manage_convertBTrees , but it don't work for us, it return with an error (and the same happens with 2.3.3 final): Error Type: TypeError Error Value: second argument must be a class Traceback (innermost last): File /fs1root/zope/Zope-2.3.3b1-src/lib/python/ZPublisher/Publish.py, line 223, in publish_module File /fs1root/zope/Zope-2.3.3b1-src/lib/python/ZPublisher/Publish.py, line 187, in publish File /fs1root/zope/Zope-2.3.3b1-src/lib/python/Zope/__init__.py, line 221, in zpublisher_exception_hook (Object: Traversable) File /fs1root/zope/Zope-2.3.3b1-src/lib/python/ZPublisher/Publish.py, line 171, in publish File /fs1root/zope/Zope-2.3.3b1-src/lib/python/ZPublisher/mapply.py, line 160, in mapply (Object: manage_convertBTrees) File /fs1root/zope/Zope-2.3.3b1-src/lib/python/ZPublisher/Publish.py, line 112, in call_object (Object: manage_convertBTrees) File /fs1root/zope/Zope-2.3.3b1-src/lib/python/Products/ZCatalog/ZCatalog.py, line 736, in manage_convertBTrees (Object: Traversable) File /fs1root/zope/Zope-2.3.3b1-src/lib/python/Products/ZCatalog/Catalog.py, line 204, in _convertBTrees File /fs1root/zope/Zope-2.3.3b1-src/lib/python/SearchIndex/UnTextIndex.py, line 211, in _convertBTrees TypeError: (see above) ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: <tdickenson@geminidataloggers.com> Cc: "Giovanni Maruzzelli" <maruzz@open4.it>; "Chris McDonough" <chrism@digicool.com>; <a.deuring@satzbau-gmbh.de>; <zope-dev@zope.org>; <erik@thingamy.net>; <barry@digicool.com>; <tsarna@endicor.com> Sent: Tuesday, June 26, 2001 5:59 PM Subject: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Toby Dickenson wrote:

...
...
INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055

Aha! a clue.

If that is the output of the 'Indexes' tab then I dont think you are using the newest ZCatalog. A recent release (im not surwe which, 2.3.2?) has a new BTree implementation that reduces bloat by modifying fewer buckets (it also doesnt have the column showing index size)

Has the person concerned run the catalog update tool when they upgraded their Zope version?

cheers,

Chris

Chris Withers

28 Jun 28 Jun

4:27 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Giovanni Maruzzelli wrote:

...

The catalog is a pristine 2.3.3b1 catalog.

I'm sure that'll need upgrading then...

...

We have recreated the catalog from scratch because we tried manage_convertBTrees , but it don't work for us, it return with an error (and the same happens with 2.3.3 final):

Error Type: TypeError Error Value: second argument must be a class

Weird... from your earlier posting it looked like you _had_ successfully upgraded and updated (BTrees.IOBTree in your traceback rather than IOBTree.IOBTree) cheers, Chris

Giovanni Maruzzelli

29 Jun 29 Jun

10:11 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

The Zope version we use contains the new btree catalog by default. So, when we recreated the catalog from scratch, it was created as a btree catalog. The traces that you saw comes from the new catalog (the btree one). -giovanni ----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Giovanni Maruzzelli" <maruzz@open4.it> Cc: <tdickenson@geminidataloggers.com>; "Chris McDonough" <chrism@digicool.com>; <a.deuring@satzbau-gmbh.de>; <zope-dev@zope.org>; <erik@thingamy.net>; <barry@digicool.com>; <tsarna@endicor.com> Sent: Thursday, June 28, 2001 6:27 PM Subject: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Giovanni Maruzzelli wrote:

...
The catalog is a pristine 2.3.3b1 catalog.

I'm sure that'll need upgrading then...

...
We have recreated the catalog from scratch because we tried manage_convertBTrees , but it don't work for us, it return with an error (and the same happens with 2.3.3 final):

Error Type: TypeError Error Value: second argument must be a class

Weird... from your earlier posting it looked like you _had_ successfully upgraded and updated (BTrees.IOBTree in your traceback rather than IOBTree.IOBTree)

cheers,

Chris

Giovanni Maruzzelli

26 Jun 26 Jun

4:03 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

I'm sorry to say that Toby is right in pointing at the version from which I cutted and pasted the following, but we are using also a newer version and the problem is the same. We're working out our way with the "dump the first bytes of the raw dump" of the new, magnificent tranalyzer from Toby (it reallly ought to be a standard tool in the Zope distro), and we have now some hints of what happen when you catalog something. So, we are starting to optimize indexes and metadata, but the problem seems not to fade away. -giovanni ----- Original Message ----- From: "Toby Dickenson" <tdickenson@devmail.geminidataloggers.co.uk> To: "Giovanni Maruzzelli" <maruzz@open4.it> Cc: "Chris McDonough" <chrism@digicool.com>; <a.deuring@satzbau-gmbh.de>; <zope-dev@zope.org>; <erik@thingamy.net>; <barry@digicool.com>; <tdickenson@geminidataloggers.com>; <tsarna@endicor.com> Sent: Tuesday, June 26, 2001 5:49 PM Subject: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

...
INDEXES: PrincipiaSearchSource Text Index 2,524 autore Keyword Index 4,055 bflow0 Field Index 4,055 bflow1 Field Index 4,055 bflow2 Field Index 4,055

Aha! a clue.

If that is the output of the 'Indexes' tab then I dont think you are using the newest ZCatalog. A recent release (im not surwe which, 2.3.2?) has a new BTree implementation that reduces bloat by modifying fewer buckets (it also doesnt have the column showing index size)

Toby Dickenson tdickenson@geminidataloggers.com

Erik Enge

11:48 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Giovanni, which Zope version are you running? On Tue, 26 Jun 2001, Chris McDonough wrote:

...

How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?

In my installation I have about 30 or 40 Position(Text)Index/KeywordIndex/FieldIndex. They don't bloat much, so I don't think that's the problem. (The problem might be that we have different views on what bloating is, though :)

Giovanni Maruzzelli

11:53 a.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

I use 2.3.3 with python 1.5.2 on freebsd 3 I'm not so picky about bloating, but adding a document of 1K adds some 400K, and keeps growing. How much eat for you (I know you cataloged some 50K documents)? -giovanni ----- Original Message ----- Sent: Tuesday, June 26, 2001 1:48 PM Subject: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Giovanni, which Zope version are you running?

On Tue, 26 Jun 2001, Chris McDonough wrote:

...
How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?

In my installation I have about 30 or 40 Position(Text)Index/KeywordIndex/FieldIndex. They don't bloat much, so I don't think that's the problem. (The problem might be that we have different views on what bloating is, though :)

Erik Enge

12:55 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

On Tue, 26 Jun 2001, Giovanni Maruzzelli wrote:

...

I'm not so picky about bloating, but adding a document of 1K adds some 400K, and keeps growing. : How much eat for you (I know you cataloged some 50K documents)?

I can't remember, but surely not that much. I had some 30.000 documents that were about 30-60Kb on average (although some were several megabytes), in addition to around 50.000 other objects (documents, if you like) indexed. My Data.fs would've been around 2.5GB if my memory serves me correctly. As I said, I had loads of Indexes too.

Toby Dickenson

11:55 a.m.

New subject: [Zope-dev] Re: [Zope] Re: Zcatalog bloat problem (berkeleydb is a solution?)

On Tue, 26 Jun 2001 06:45:54 -0400, Chris McDonough <chrism@digicool.com> wrote:

...

I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it.

What Chris describes would be a prudent change anyway, however I dont think it is the root of this problem. The tranalyzer output shows the following line for the Folder. At a length of 363 I guess it is pretty empty. Even if this object grows to 100k (when adding the 100th item) it is not the single biggest cause of bloat to the total transaction size. (incidentally, it *was* the cause of the bloat problems that led me to develop this patched tranalyzer)

...

...
OID: 40817 len 363 [OFS.Folder.Folder]

The following entries I do find interesting. They are all somewhat larger that I remember seeing before. Are you indexing *large* properties (or storing large metadata values)? It may be interesting to see the raw pickle data for these large objects...... my patched tranalyzer can do that too.

...

...
OID: 37ac4 len 25537 [BTrees.IIBTree.IISet] OID: 37aca len 13947 [BTrees.IIBTree.IISet] OID: 388ff len 24707 [BTrees.IIBTree.IISet] OID: 37ab8 len 39336 [BTrees.IOBTree.IOBTree] OID: 3c610 len 33864 [BTrees.IOBTree.IOBucket]

Toby Dickenson tdickenson@geminidataloggers.com

abel deuring

12:52 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Hi Giovanni, Chris and all others, Chris McDonough wrote:

...

Hi Giovanni,

How many indexes do you have, what are the index types, and what do they index? Likewise, what about metadata? In your last message, you said there's about 20. That's a heck of a lot of indexes. Do you need them all?

I can see a potential reason for the problem you explain as "and I remind you that as the folder get populated, the size that is added to each transaction grows, a folder with one hundred objects adds some 100K"... It's true that "normal" folders (most ObjectManager-derived containers actually) cause database bloat within undoing storages when an object is added or removed from it. This is because it keeps a list of contained subobject names in an "_objects" attribute, which is a tuple. When an object is added, the tuple is rewritten in entirety. So for instance, if you've got 100 items in your folder, and you add one more, you rewrite all the instance data for the folder itself, which includes the (large) _objects tuple (and of course, any other raw attributes, like properties). Over time, this can be problematic.

Shane's BTreeFolder Product attempts to ameliorate this problem a bit by keeping the data that is normally stored in the _objects tuple in its own persistent object (a btree).

Are you breaking the content up into subfolders? This is recommended.

I'm temped to postulate that perhaps your problem isn't as much ZCatalog as it is ObjectManager overhead.

Well, I'm not very familiar with the details about the sub-object management of ObjectManager and friends. Moreover, I had yet a closer look only into UnTextIndex, but not into UnIndex or UnKeywordIndex. So take my comments with a grain of salt. A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes. If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text. This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog". Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions. OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents. Abel

Chris McDonough

1:31 p.m.

New subject: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

abel deuring wrote:

...

A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for

Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.

...

each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes.

If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.

Nah... I don't think so. At least I hope not! Each bucket in a BTree is a separate persistent object. So only the sum of the data in the updated buckets will be appended to the ZODB. So if you add an item to a BTree, you don't add 24000+ bytes for each update. You just add the amount of space taken up by the bucket... unfortunately I don't know exactly how much this is, but I'd imagine it's pretty close to the datasize with only a little overhead.

...

This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog".

Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions.

OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents.

Another option is to use a session manager, and update the catalog at session-end. - C

Casey Duncan

2:49 p.m.

New subject: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough wrote:

...

abel deuring wrote:

...
A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for

Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.

[snip] What if any disadvantages are there to not calling unindex_object first? If there aren't any good ones, I think I'll be rewriting some of my own "CatalogAware" code... -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Chris McDonough

3:02 p.m.

New subject: [Zope] Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Off the top of my head, I don't think there are any. But this is why I haven't fixed it yet, because I'd need to think about it past "off the top of my head". ;-) - C Casey Duncan wrote:

...

What if any disadvantages are there to not calling unindex_object first? If there aren't any good ones, I think I'll be rewriting some of my own "CatalogAware" code... -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

Casey Duncan

5:39 p.m.

New subject: [Zope-dev] Hey Chris, question for you

Chris: I am working on getting a decent query language for ZCatalog/Catalog and I have been able to make good progress, however I am running into a bit of an issue that I thought you might know something about: In order to implement a "!=" query operator, I am trying to do the following:

...

From the index, return the result set that match the value (easy) Subtract that from the set of all items in the index (not so easy)

I see that there is the difference method available from IIBTree, however I seem to be unable to use it on the entire index (Which is an OOBTree and not really a set I guess). Here is a snippit of my code which doesn't work: if op == '!=' or op[:3] == 'not': w, rs = difference(index._index, rs) # XXX Not a warm fuzzy... (where rs is the index result set that matches the value and index is the Catalog index OOBTree) What can I supply for the first argument to get a set of all items in the index, or is there any easier and better approach to this whole issue? BTW: I realize I could step though _index.items() and create an IISet but that seems awful inefficient... Thanks in advance for any ideas you might have... -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Chris McDonough

6:24 p.m.

New subject: [Zope-dev] Hey Chris, question for you

...

Chris:

I am working on getting a decent query language for ZCatalog/Catalog and

Very cool...

...

I have been able to make good progress, however I am running into a bit of an issue that I thought you might know something about:

In order to implement a "!=" query operator, I am trying to do the following:

Tricky.

...

...
From the index, return the result set that match the value (easy) Subtract that from the set of all items in the index (not so easy)

I see that there is the difference method available from IIBTree, however I seem to be unable to use it on the entire index (Which is an OOBTree and not really a set I guess). Here is a snippit of my code which doesn't work:

if op == '!=' or op[:3] == 'not': w, rs = difference(index._index, rs) # XXX Not a warm fuzzy...

(where rs is the index result set that matches the value and index is the Catalog index OOBTree)

What can I supply for the first argument to get a set of all items in the index, or is there any easier and better approach to this whole issue?

Well.. I assume that _index is the forward data structure of a FieldIndex. In this case, you could get the info you want (a list of all document ids in the index) from _unindex.keys(), as _index and _unindex are mirror images of each other that need to be kept in sync... I think what comes back is a BTreeItems object. I think this is usable in conjunction with the resultset IISet (also a list of document ids) via the difference function... I haven't tried it, though...

...

BTW: I realize I could step though _index.items() and create an IISet but that seems awful inefficient...

Yeah, that'd be terrible. This is a tricky operator. I can't really wrap my head around using it in conjunction with parens. Then again, maybe you wouldn't... HTH, - C

Casey Duncan

10:34 p.m.

New subject: [Zope-dev] Hey Chris, question for you

Chris McDonough wrote:

...

...
Chris:

I am working on getting a decent query language for ZCatalog/Catalog and

Very cool...

...
I have been able to make good progress, however I am running into a bit of an issue that I thought you might know something about:

In order to implement a "!=" query operator, I am trying to do the following:

Tricky.

Ok, I was able to get it to work by instantiating a IISet around _unindex.keys() and passing that to difference (Thanks!), however, I notice an interesting side effect. Let's say you have a TextIndex on title and you do the following query: title != 'foo*' Which to me means: "all cataloged objects whose title do not match the substring 'foo*'" However, this is not what you get exactly, instead you get: "all cataloged objects that have a non-empty title that does not match the substring 'foo*'" Because from what I am seeing, objects with empty (or no) titles are not included in the index *at all*. So the set of "all objects" does not include ones without titles. I could fix this by making all objects be instead "All objects in the catalog" (via catalog.data.keys()) instead of "all objects in the index", but I wanted to see if anyone had additional thoughts about this. -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Michel Pelletier

10:42 p.m.

New subject: [Zope-dev] Hey Chris, question for you

On Tue, 26 Jun 2001, Casey Duncan wrote:

...

Ok, I was able to get it to work by instantiating a IISet around _unindex.keys() and passing that to difference (Thanks!), however, I notice an interesting side effect. Let's say you have a TextIndex on title and you do the following query:

title != 'foo*'

Which to me means: "all cataloged objects whose title do not match the substring 'foo*'"

However, this is not what you get exactly, instead you get:

"all cataloged objects that have a non-empty title that does not match the substring 'foo*'"

Because from what I am seeing, objects with empty (or no) titles are not included in the index *at all*. So the set of "all objects" does not include ones without titles. I could fix this by making all objects be instead "All objects in the catalog" (via catalog.data.keys()) instead of "all objects in the index", but I wanted to see if anyone had additional thoughts about this.

Hmm the reason for the current behavior was optimization by saving space not indexing empty values. The problem with your latter aproach is that "all objects in the catalog" may include object that don't have a title attribute at all. I'm not against indexing empty values though. -Michel

Chris McDonough

27 Jun 27 Jun

1:17 a.m.

New subject: [Zope-dev] Hey Chris, question for you

Hi casey, Changes were recently made to Field/Keyword Indexes so that they will store empty items. An equivalent change could be made to TextIndexes... we'd need to think about that a bit. But for your purposes, you might want to start out attempting to write your operator implementation using Field and Keyword indexes... - C Michel Pelletier wrote:

...

On Tue, 26 Jun 2001, Casey Duncan wrote:

...
Ok, I was able to get it to work by instantiating a IISet around _unindex.keys() and passing that to difference (Thanks!), however, I notice an interesting side effect. Let's say you have a TextIndex on title and you do the following query:

title != 'foo*'

Which to me means: "all cataloged objects whose title do not match the substring 'foo*'"

However, this is not what you get exactly, instead you get:

"all cataloged objects that have a non-empty title that does not match the substring 'foo*'"

Because from what I am seeing, objects with empty (or no) titles are not included in the index *at all*. So the set of "all objects" does not include ones without titles. I could fix this by making all objects be instead "All objects in the catalog" (via catalog.data.keys()) instead of "all objects in the index", but I wanted to see if anyone had additional thoughts about this.

Hmm the reason for the current behavior was optimization by saving space not indexing empty values. The problem with your latter aproach is that "all objects in the catalog" may include object that don't have a title attribute at all.

I'm not against indexing empty values though.

-Michel

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

Casey Duncan

2:37 p.m.

New subject: [Zope-dev] Hey Chris, question for you

Chris McDonough wrote:

...

Hi casey,

Changes were recently made to Field/Keyword Indexes so that they will store empty items. An equivalent change could be made to TextIndexes... we'd need to think about that a bit.

But for your purposes, you might want to start out attempting to write your operator implementation using Field and Keyword indexes...

- C

Michel Pelletier wrote:

...
Hmm the reason for the current behavior was optimization by saving space not indexing empty values. The problem with your latter aproach is that "all objects in the catalog" may include object that don't have a title attribute at all.

I'm not against indexing empty values though.

-Michel

My implementation does not modify the behavior of the indexes in any way, and I would like to keep it that way if possible. I have been able to (thus far) pull this off without compromises, which was my hope in the beginning. I guess the question here is given the query: spam != 'eggs' Should objects be returned that do not have an attribute "spam" at all. For the behavior to be intuitive, I would say yes, but that is just my opinion. I also though of an optimization that could eventually be included if this behavior is adopted. for example, take the following query expression: title == 'foo' and spam != 'eggs' As implemented, my query engine does the following: 1. Find items where title matches 'foo' (exact behavior depends on index type) 2. Find items where spam matches 'eggs' 3. Take the difference of all items in the index spam and the result of #2 4. Return the intersection of #3 and #1 To be "intuitive" (I use that term loosely) I think it should be: 1. Find items where title matches 'foo' 2. Find items where spam matches 'eggs' 3. Take the difference of all items in the catalog and the result of #2 4. Return the intersection of #3 and #1 Which can be optimized as: 1. Find items where title matches 'foo' 2. Find items where spam matches 'eggs' 3. Return the difference #1 and #2 If an "or" is used in place of the "and", then the optimization doesn't apply though. One other thing: I noticed (with a colleague) that passing a list of values to a FieldIndex and a TextIndex results in nearly opposite behavior. The fieldIndex does a union on the results of querying against each item in the list whereas TextIndex does an intersection. This seemed highly inconsistent to me, another thread perhaps... -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Toby Dickenson

3:47 p.m.

New subject: [Zope-dev] Hey Chris, question for you

On Tue, 26 Jun 2001 15:42:40 -0700 (PDT), Michel Pelletier <michel@digicool.com> wrote:

...

Hmm the reason for the current behavior was optimization by saving space not indexing empty values.

I was always very pleased with that characteristic, but I had not realised it was a design goal. I thought I observed that characteristic had changed in a recent Zope release... hmmm, Ill take a look. Toby Dickenson tdickenson@geminidataloggers.com

Chris McDonough

3:55 p.m.

New subject: [Zope-dev] Hey Chris, question for you

I think it has changed for FieldIndexes. You can now make the distinction between "doesnt have that attribute" and "attribute is one of [None, '', [], ()]" within a Field Index. You do this in an almost natural way, the major exception being that you need to wrap a blank string ('') in a sequence in the query (e.g. title=['']) due to hysterical behavior. I'm not sure about Text Indexes. ----- Original Message ----- From: "Toby Dickenson" <tdickenson@devmail.geminidataloggers.co.uk> To: "Michel Pelletier" <michel@digicool.com> Cc: "Casey Duncan" <cduncan@kaivo.com>; "Chris McDonough" <chrism@digicool.com>; <zope-dev@zope.org> Sent: Wednesday, June 27, 2001 11:47 AM Subject: Re: [Zope-dev] Hey Chris, question for you On Tue, 26 Jun 2001 15:42:40 -0700 (PDT), Michel Pelletier <michel@digicool.com> wrote:

...

Hmm the reason for the current behavior was optimization by saving space not indexing empty values.

Toby Dickenson

26 Jun 26 Jun

2:52 p.m.

New subject: [Zope-dev] CatalogAware

On Tue, 26 Jun 2001 09:31:02 -0400, Chris McDonough <chrism@digicool.com> wrote:

...

Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.

CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are: a. Unindexing before ReIndexing causes bloat by defeating the catalogs change-detection tricks. b. It uses URLs not paths, and so doesnt play right with virtual hosting c. It uses the same hooks as ObjectManager to detect that it has been added/removed from a container ObjectManager, and therefore the two cant be easily mixed together as base classes. All of these are fixable, and I feel a patch coming on. Are there some deeper problems I am not aware of? Toby Dickenson tdickenson@geminidataloggers.com

Chris McDonough

3 p.m.

New subject: [Zope-dev] Re: [Zope] CatalogAware

I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility. - C Toby Dickenson wrote:

...

On Tue, 26 Jun 2001 09:31:02 -0400, Chris McDonough <chrism@digicool.com> wrote:

...
Right.. if you don't use CatalogAware, however, and don't unindex before reindexing an object, you should see a huge bloat savings, because the only things which are supposed to be updated then are indexes and metadata which have data that has changed.

CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are:

a. Unindexing before ReIndexing causes bloat by defeating the catalogs change-detection tricks.

b. It uses URLs not paths, and so doesnt play right with virtual hosting

c. It uses the same hooks as ObjectManager to detect that it has been added/removed from a container ObjectManager, and therefore the two cant be easily mixed together as base classes.

All of these are fixable, and I feel a patch coming on.

Are there some deeper problems I am not aware of?

Toby Dickenson tdickenson@geminidataloggers.com

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

Toby Dickenson

4:36 p.m.

New subject: [Zope-dev] Re: [Zope] CatalogAware

Chris McDonough <chrism@digicool.com> wrote:

...

I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility.

Not a patch, but Ive fixed all three known CatalogAware problems in a separate product; a new base class that derives from CatalogAware: http://www.zope.org/Members/htrd/BetterCatalogAware/ The techniques used in this product have been thoroughly stressed in several other production systems, but this is the first time they have been collected together in one place so bugs are possible.

...

That makes CatalogAware much saner and will produce less bloat. Actually, maybe I should just go make that change in the trunk and the 2.4 branch, although I'm a little afraid of what (if anything) it will break for everybody. To be honest, I really don't have much time to spend thinking about this, and my fears are probably just FUD.

Im not sure how many people are using CatalogAware; I think many serious users have been scared off by the problem reports in the list archives. IMO fixing this may be worth a little breakage. Toby Dickenson tdickenson@geminidataloggers.com

Chris McDonough

5:01 p.m.

New subject: [Zope-dev] Re: [Zope] CatalogAware

Excellent, thanks so much Toby. Maybe some feedback will come in... - C Toby Dickenson wrote:

...

Chris McDonough <chrism@digicool.com> wrote:

...
I actually think this about sums it up. If you have time to look at it Toby, it would be much appreciated. I don't think it's a very complicated set of fixes, its just not on the radar at the moment, and might require some thought about backwards-compatibility.

Not a patch, but Ive fixed all three known CatalogAware problems in a separate product; a new base class that derives from CatalogAware:

http://www.zope.org/Members/htrd/BetterCatalogAware/

The techniques used in this product have been thoroughly stressed in several other production systems, but this is the first time they have been collected together in one place so bugs are possible.

...
That makes CatalogAware much saner and will produce less bloat. Actually, maybe I should just go make that change in the trunk and the 2.4 branch, although I'm a little afraid of what (if anything) it will break for everybody. To be honest, I really don't have much time to spend thinking about this, and my fears are probably just FUD.

Im not sure how many people are using CatalogAware; I think many serious users have been scared off by the problem reports in the list archives.

IMO fixing this may be worth a little breakage.

Toby Dickenson tdickenson@geminidataloggers.com

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

Jeff Sasmor

4:37 p.m.

New subject: [Zope-dev] Re: [Zope] CatalogAware

**************************************************************** Subject: [Zope] CatalogAware

...

CatalogAware has been blamed for alot of problems. Its three weaknesses I am aware of are: <snip>

b. It uses URLs not paths, and so doesnt play right with virtual hosting

***************************************************************** I ran into this problem using VHMonster with my EventFolder product and found a work-around, just for anyone who might be struggling with this See http://www.netkook.com/Members/jeff/ef/faq/document_view#vhost This article discusses how to use _vh_ with VHM. (boy does that sound cryptic...) Jeff Sasmor jeff@sasmor.com www.netkook.com

Giovanni Maruzzelli

4:33 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

We think that Abel is absolutely right: if in the same almost empty folder we add and catalog an object with one word (and now we have optimized and reduced the number of indexes to 11) it make a transaction of 73K, while if the object contains 300 words with the same other indexes or properties, the transaction is 224K, and if all is the same but the object contains 535 words the transaction is 331K. And we are using now a catalog with only some 3000 document indexed with a medium lenght of each document around 1K. -giovanni

...

Well, I'm not very familiar with the details about the sub-object management of ObjectManager and friends. Moreover, I had yet a closer look only into UnTextIndex, but not into UnIndex or UnKeywordIndex. So take my comments with a grain of salt.

A text index (class SearchIndex.UnTextIndex) is definetely is a cause of bloating, if you use CatalogAware objects. An UnTextIndex maintains for each word a list of documents, where this word appears. So, if a document to be indexed contains, say, 100 words, 100 IIBTrees (containing mappings documentId -> word score) will be updated. (see UnTextIndex.insertForwardIndexEntry) If you have a larger number of documents, these mappings may be quite large: Assume 10.000 documents, and assume that you have 10 words which appear in 30% of all documents. Hence, each of the IIBTrees for these words contains 3000 entries. (Ok, one can try to keep this number of frequent words low by using a "good" stop word list, but at least for German, such a list is quite difficult to build. And one can argue that many "not too really frequent" words should be indexed in order to allow more precise phrase searches)I don't know the details, how data is stored inside the BTress, so I can give only a rough estimate of the memory requirements: With 32 bit integers, we have at least 8 bytes per IIBTree entry (documentId and score), so each of the 10 BTree for the "frequent words" has a minimum length of 3000*8 = 24000 bytes.

If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.

This is the reason, why I'm working on some kind of "lazy cataloging". My approach is to use a Python class (or Base class,if ZClasses are involved), which has a method manage_afterAdd. This method looks for superValues of a type like "lazyCatalog" (derived from ZCatalog), and inserts self.getPhysicalPath() into the update list of each found "lazyCatalog".

Later, a "lazyCatalog" can index all objects in this list. Then, then bloating happens either in RAM (without subtransaction), or in a temporary file, if you use subtransactions.

OK, another approach which fits better to your (Giovanni) needs might be to use another data base than ZODB, but I'm afarid that even then "instant indexing" will be an expensive process, if you have a large number of documents.

Abel

abel deuring

6:40 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Hi all, Giovanni Maruzzelli wrote:

...

We think that Abel is absolutely right:

if in the same almost empty folder we add and catalog an object with one word (and now we have optimized and reduced the number of indexes to 11) it make a transaction of 73K, while if the object contains 300 words with the same other indexes or properties, the transaction is 224K, and if all is the same but the object contains 535 words the transaction is 331K.

And we are using now a catalog with only some 3000 document indexed with a medium lenght of each document around 1K.

Well, Chris certainly knows more about the internals of ZCatalog than I do, so we should not ignore his comments to my mail :) Chris McDonough wrote:

...

...
If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.

Nah... I don't think so. At least I hope not! Each bucket in a BTree is a separate persistent object. So only the sum of the data in the updated buckets will be appended to the ZODB. So if you add an item to a BTree, you don't add 24000+ bytes for each update. You just add the amount of space taken up by the bucket... unfortunately I don't know exactly how much this is, but I'd imagine it's pretty close to the datasize with only a little overhead.

OK, this made me curious, so I made test similar to the one by Giovanni. I started with a ZCatalog containing 21616 records; the catalog contains only one text index, no keyword index, no field index. I copied one of the indexed documents; the text is 2645 bytes long; wc tells me that it has 313 words. Next, I packed the data base in order to have a "clean start point". After packing, Data.fs has a size of 233661963 byte. Then I cataloged the new object using my "lazy catalog". Since I have only one new document, this is basically the same as using CatalogAwareness. After indexing, the data base has grown to 233851090 bytes -- an increase of 189127 bytes. Then I packed the data base again, resulting in a size of 233666237 bytes. So the "net increase" is indeed 233666237-233661963 = 4274 bytes, as you expected, but obviously a few more data base records need to be updated. Abel

Chris McDonough

6:34 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Yikes. I wonder if this overhead comes from Vocabulary updates... thanks very much for doing this test. Clearly we need to pin it down. This is very disappointing. :-( Any further info you dig up is appreciated. You didn't have any metadata stuff set up, did you? I imagine even if you did, that they couldn't possibly account for 200K worth of extra stuff. - C ----- Original Message ----- From: "abel deuring" <adeuring@gmx.net> To: "Giovanni Maruzzelli" <maruzz@open4.it> Cc: "Chris McDonough" <chrism@digicool.com>; <zope-dev@zope.org>; <erik@thingamy.net>; <barry@digicool.com>; <tdickenson@geminidataloggers.com>; <tsarna@endicor.com> Sent: Tuesday, June 26, 2001 2:40 PM Subject: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Hi all,

Giovanni Maruzzelli wrote:

...
We think that Abel is absolutely right:

if in the same almost empty folder we add and catalog an object with one word (and now we have optimized and reduced the number of indexes to 11)

it

...
make a transaction of 73K, while if the object contains 300 words with the same other indexes or properties, the transaction is 224K, and if all is the same but the object contains 535 words the transaction is 331K.

And we are using now a catalog with only some 3000 document indexed with a medium lenght of each document around 1K.

Well, Chris certainly knows more about the internals of ZCatalog than I do, so we should not ignore his comments to my mail :)

Chris McDonough wrote:

...
...
If you now add a new document containing 5 of these frequent words, 5 larger BTrees will be updated. [Chris, let me know, if I'm now going to tell nonsense...] I assume that the entire updated BTrees = 120000 bytes will be appended to the ZODB (ignoring the less frequent words) -- even if the document contains only 1 kB text.

Nah... I don't think so. At least I hope not! Each bucket in a BTree is a separate persistent object. So only the sum of the data in the updated buckets will be appended to the ZODB. So if you add an item to a BTree, you don't add 24000+ bytes for each update. You just add the amount of space taken up by the bucket... unfortunately I don't know exactly how much this is, but I'd imagine it's pretty close to the datasize with only a little overhead.

OK, this made me curious, so I made test similar to the one by Giovanni. I started with a ZCatalog containing 21616 records; the catalog contains only one text index, no keyword index, no field index. I copied one of the indexed documents; the text is 2645 bytes long; wc tells me that it has 313 words. Next, I packed the data base in order to have a "clean start point". After packing, Data.fs has a size of 233661963 byte.

Then I cataloged the new object using my "lazy catalog". Since I have only one new document, this is basically the same as using CatalogAwareness. After indexing, the data base has grown to 233851090 bytes -- an increase of 189127 bytes. Then I packed the data base again, resulting in a size of 233666237 bytes.

So the "net increase" is indeed 233666237-233661963 = 4274 bytes, as you expected, but obviously a few more data base records need to be updated.

Abel

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

abel deuring

7:46 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough wrote:

...

Yikes. I wonder if this overhead comes from Vocabulary updates... thanks very much for doing this test.

No, this should definetely _not_ be related to vocabulary: I simply copied an already indexed document and let ZCatalog.catalog_object munge the copy. So all words appearing in this copy already have an entry in the Vocabulary. I also checked it during a test without meta data: The vocabulary doed not increase.

...

Clearly we need to pin it down. This is very disappointing. :-( Any further info you dig up is appreciated.

Well, I don't have any at present. But allow me to make some guess :) If a new record is added to a BTree, is can be necessary to move a few other records around in order to keep the tree balanced. And some of the BTrees affected by my test are definitely somewhat larger, because I did not use German stop words during the test, so words like "und", "der", "die" are indexed which appear in _every_ document. (well, at least in _nearly_ every document)

...

You didn't have any metadata stuff set up, did you? I imagine even if you did, that they couldn't possibly account for 200K worth of extra stuff.

Ouch, I forgot about the meta data. So here is the result of another test, with all meta data thrown away: Packed data base size, one document (same during the last test) to be cataloged: 229170221 bytes. data base size after updating the catalog run: 229310316 bytes size after packing: 229172566 bytes So, same as before :( Abel

Chris McDonough

8:04 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

...

Chris McDonough wrote:

...
Yikes. I wonder if this overhead comes from Vocabulary updates...

thanks

...
very much for doing this test.

No, this should definetely _not_ be related to vocabulary: I simply copied an already indexed document and let ZCatalog.catalog_object munge the copy. So all words appearing in this copy already have an entry in the Vocabulary. I also checked it during a test without meta data: The vocabulary doed not increase.

OK, that's good to know...

...

...
Clearly we need to pin it down. This is very disappointing. :-( Any further info you dig up is appreciated.

Well, I don't have any at present. But allow me to make some guess :) If a new record is added to a BTree, is can be necessary to move a few other records around in order to keep the tree balanced. And some of the BTrees affected by my test are definitely somewhat larger, because I did not use German stop words during the test, so words like "und", "der", "die" are indexed which appear in _every_ document. (well, at least in _nearly_ every document)

...
You didn't have any metadata stuff set up, did you? I imagine even if

you

...
did, that they couldn't possibly account for 200K worth of extra stuff.

Ouch, I forgot about the meta data. So here is the result of another test, with all meta data thrown away:

Packed data base size, one document (same during the last test) to be cataloged: 229170221 bytes.

data base size after updating the catalog run: 229310316 bytes size after packing: 229172566 bytes

So, same as before :(

Well, I'm sort of stumped without doing it myself, and I can't at the moment. I'm going to add this to the Collector so I don't forget, and hopefully it will be looked into and fixed by the time that 2.4.0 goes out. Thanks so much, - C

Erik Enge

25 Jun 25 Jun

10:15 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

(I removed <zope@zope.org>.) On Mon, 25 Jun 2001, Giovanni Maruzzelli wrote:

...

Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided?

Erik, any toughts?

Well, after ChrisM told me about the behaviour of CatalogAwareness, and I removed that from my classes, my ZCatalog bloatness has evaporated. I really can't see any major bloat-problem on either memory-consumption or disk-space. (That was with Zope 2.3.2b2.) Which is very good for me, but doesn't necessarily help you much. :)

Chris McDonough

11:17 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Part of the problem here is that if, in particular, you use the reindex_object method of CatalogAware, the database will grow unnecessarily even if the object hasn't changed. CatalogAware is arguably broken and should really not be used. I'd like to have the time to fix it, but fixing it implies taking time out that I don't have at the moment to test the changes, and *may* imply breaking it in slightly other backwards-incompatible ways that will cause people to scream. For instance, unfortunately, CatalogAware also uses the *url* of the object to index it, which is contrary to the way that newer Catalogs work (they index the physical path of the object). In the meantime, if you care at all about cataloging, do not use CatalogAware. Instead, manage the recataloging yourself and don't uncatalog a changed object before recataloging it during this manual operation, because this defeats all of the carefully set up change detection code (which may or may not still be working since I last worked on it ;-) - C Erik Enge wrote:

...

(I removed <zope@zope.org>.)

On Mon, 25 Jun 2001, Giovanni Maruzzelli wrote:

...
Any hints on how to manage something like? We use both textindexes, fieldindexes, and keywordsindexes (textindex on string properties, fieldindexes on boolean and datetime, keywordindex on strings). Maybe one kind of indexes is to be avoided?

Erik, any toughts?

Well, after ChrisM told me about the behaviour of CatalogAwareness, and I removed that from my classes, my ZCatalog bloatness has evaporated. I really can't see any major bloat-problem on either memory-consumption or disk-space. (That was with Zope 2.3.2b2.)

Which is very good for me, but doesn't necessarily help you much. :)

Eric Roby

26 Jun 26 Jun

12:54 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Chris McDonough Wrote:

...

CatalogAware is arguably broken and should really not be used.

...

In the meantime, if you care at all about cataloging, do not use CatalogAware. Instead, manage the recataloging yourself and don't uncatalog a changed object before recataloging it during this manual operation, because this defeats all of the carefully set up change detection code (which may or may not still be working since I last worked on it ;-)

Chris, Thank you for your candor here. I wish this minor detail had been disclosed in the Zope book. Chapter 9 was my holy grail when I started down this trail (creating these new ZClasses that would auto catalog themselves). It looked good in print... I have banked a good deal of my project on this very service and ... well it is a bit frustrating to find out that I need to go back and re-do my work. Along this same vein, I would suggest that (possibly) ZClasses don't really work, either, "and should not be used". There was a comment from another developer (on zope-dev a month or so ago) that essentially (in his own words) made this very claim. At the time, I chalked it up to this "Real Zope Developers Don't Use ZCLasses" kinda comment. There certainly are enough Zope products out there that (at least) leverage some of the ZClass plumbing. Another claim in the Zope book (chapter 8) says that I can leverage my 6+ years of Perl experience to create Zope scripts. Well, I would suggest that this doesn't really work, either... The bottom line to all this venting (and I am not trying to shoot the messenger here) is that I need to understand where my efforts should be focused. If I need to abandon ZClasses in lieu of pure Python, then I need to know that now so I don't waste any more time on these false starts. The Perl thing is just a matter of principle (I think Perl's implementation of OO stinks). The way it is presented in the book, I would expect it to be a core Zope thing and not some appendage that requires a particular compiler and Andy sitting next to you. I don't intend to abandon Zope, I just need a reality check... Eric

Chris McDonough

1:42 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Eric Roby wrote:

...

Chris McDonough Wrote:

...
CatalogAware is arguably broken and should really not be used.

...
In the meantime, if you care at all about cataloging, do not use CatalogAware. Instead, manage the recataloging yourself and don't uncatalog a changed object before recataloging it during this manual operation, because this defeats all of the carefully set up change detection code (which may or may not still be working since I last worked on it ;-)

Chris,

Thank you for your candor here. I wish this minor detail had been disclosed in the Zope book. Chapter 9 was my holy grail when I started down this trail (creating these new ZClasses that would auto catalog themselves). It looked good in print... I have banked a good deal of my project on this very service and ... well it is a bit frustrating to find out that I need to go back and re-do my work.

Well.. actually, it's pretty simple to change CatalogAware to work better for you. With a little thought, CatalogAware could be hacked at your end to be sane for your application. You needn't rewrite all your code. It's just hard for DC to release a perfect CatalogAware that works better and is completely backwards-compatible. It's much harder to change it to work perfectly for everybody (which is our job here ;-) than to change it to work perfectly for a particular application. Basically, change the reindex_object method to: self.index_object() Instead of: self.unindex_object() self.index_object() That makes CatalogAware much saner and will produce less bloat. Actually, maybe I should just go make that change in the trunk and the 2.4 branch, although I'm a little afraid of what (if anything) it will break for everybody. To be honest, I really don't have much time to spend thinking about this, and my fears are probably just FUD.

...

Along this same vein, I would suggest that (possibly) ZClasses don't really work, either, "and should not be used". There was a comment from another developer (on zope-dev a month or so ago) that essentially (in his own words) made this very claim. At the time, I chalked it up to this "Real Zope Developers Don't Use ZCLasses" kinda comment. There certainly are enough Zope products out there that (at least) leverage some of the ZClass plumbing.

Well, I dont use ZClasses much. But that's because I like to use Emacs.

...

Another claim in the Zope book (chapter 8) says that I can leverage my 6+ years of Perl experience to create Zope scripts. Well, I would suggest that this doesn't really work, either...

Not sure what you mean by doesnt work, but I assume you've had an unpleasant experience with zope-perl?

...

The bottom line to all this venting (and I am not trying to shoot the messenger here) is that I need to understand where my efforts should be focused. If I need to abandon ZClasses in lieu of pure Python, then I need to know that now so I don't waste any more time on these false starts. The

I'll go out on a limb here. You should learn how to write Python Products if you're serious about creating reusable Zope applications. There.

...

Perl thing is just a matter of principle (I think Perl's implementation of OO stinks). The way it is presented in the book, I would expect it to be a core Zope thing and not some appendage that requires a particular compiler and Andy sitting next to you.

I've sort of enjoyed myself on all the times when Andy has been sitting near me, but I understand. ;-) Jim had a bad experience installing zope-perl lately. I wish I could help. Strangely, myself, I had few problems getting it installed and working fine. Maybe I'm just lucky. I actually think zope-perl is sort of an engineering marvel myself.

...

I don't intend to abandon Zope, I just need a reality check...

HTH, - C

Erik Enge

3:45 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

On Tue, 26 Jun 2001, Eric Roby wrote:

...

The bottom line to all this venting (and I am not trying to shoot the messenger here) is that I need to understand where my efforts should be focused. If I need to abandon ZClasses in lieu of pure Python, then I need to know that now so I don't waste any more time on these false starts.

If your application can't be written in five minutes and you expect to use it more than once, you shouldn't use ZClasses - IMO. The only argument for ZClasses (that I had at the time) was that it was very easy and fast to set up a couple of classes and some instances. After I wrote mk-zprod, making Python Products is even faster than ZClasses, and certainly scales better. If you ask me, it would be better to streamline the Zope API a bit and focus the effort on making it easier to start developing Python Products at first go, instead of stopping by ZClasses. I can't see the rationale for ZClasses, but I'm sure there is one. Right? I seem to recall some fuzz about Python Products starting be "alive" in the Zope instance (ie. behaving much like ZClasses) in a future release. I don't know if that's a good thing or not, but if it means ditching ZClasses I'm all for it.

Morten W. Petersen

4:30 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Erik Enge wrote:

...

If your application can't be written in five minutes and you expect to use it more than once, you shouldn't use ZClasses - IMO. The only argument for ZClasses (that I had at the time) was that it was very easy and fast to set up a couple of classes and some instances. After I wrote mk-zprod, making Python Products is even faster than ZClasses, and certainly scales better.

Another thing is transparency and control. With source files, it's easier to 'see'; not to mention that code can be factored out into generic python modules in a less cumbersome way. How about meta-programming (designing) via the Zope interface, with UML or somesuch; automatically generating Python code, then enable designers to use a ZFormulator-ish product to edit the interface while a programmer can work on the 'backend' (emacs on a terminal)? One thing I'd really like to implement is DTMLFile transparency via the web, so that a designer could enter into Control_Panel/Products/MyProduct and edit webinterface files there, reflecting it on the filesystem. ZClasses (as they are today, 'half-baked') should be tossed out, and focus brought on making one approach easier. -Morten

Erik Enge

5:38 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Morten W. Petersen wrote:

...

How about meta-programming (designing) via the Zope interface, with UML or somesuch; automatically generating Python code, then enable designers to use a ZFormulator-ish product to edit the interface while a programmer can work on the 'backend' (emacs on a terminal)?

What are you on my friend? ;-) How about writing the whole shebang in <put in favourite editor here>, as is done today? I don't see the need for any other change, but I do know that DC (or atleast I suspect this is why they made ZClasses) hopes to bring "Zope-programming" to a wider audience with ZClasses, and thus it might have a purpose in life. I don't know. Python is a very easy language to learn.

Andy McKay

5:42 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

One thing Id been musing about for a while was a ZClass > Python Product script that took your ZClass and set up your basic python product for you. It would only work for simple for things like permissions, properties, basic methods... Then ZClasses could be an easier springboard into python products for those new to them. Cheers. -- Andy McKay.

Erik Enge

5:53 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Andy McKay wrote:

...

One thing Id been musing about for a while was a ZClass > Python Product script that took your ZClass and set up your basic python product for you. It would only work for simple for things like permissions, properties, basic methods... Then ZClasses could be an easier springboard into python products for those new to them.

*ponder-wonder* I think this is functionality I could easily add to mk-zprod. I might just do that.

Andy McKay

6:07 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

...

...
One thing Id been musing about for a while was a ZClass > Python Product script that took your ZClass and set up your basic python product for you. It would only work for simple for things like permissions, properties, basic methods... Then ZClasses could be an easier springboard into python products for those new to them.

*ponder-wonder* I think this is functionality I could easily add to mk-zprod. I might just do that.

Yeah! One more thing I dont have to feel guilty about not doing... Just noticed mk-zprod, cool. Cheers. -- Andy McKay.

Stephan Richter

6:38 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

Hello everyone, I gotta join this discussion. iuveno was also thinking about a tool that would replace ZClasses, since their performance is far too bad. We had a not so good experience with the ZClass-based Kontentor and now that the first part is rewritten in Python we can see the speed-ups (the performance increase can be measured in multiples - real tests need to be done). The reason - in my opinion - that ZClasses are so slow are the huge amount of Acquistion lookups and the save rendering. You can often code things smarter in Python using much less of the "safe" Zope environment, but still providing safety through specific commands. I think Formulator is a great example of how safe the Python programming can be. There are two thoughts here: 1. We are building a wizard that asks you all the necessary questions to generate a basic class framwework. This wizard (which can be used in many other fields - such as installers - as well) is currently being built. We use Formulator a lot and I support the development of it as much as I can (it is a cool product with many cool features). If anyone is interested in helping developing that tool (which will be released under the GPL as all of the iuveno products), then I can make an electronic copy of my personal notes and I setup a CVS. Formulator at Zope: http://www.zope.org/Members/faassen/Formulator Formulator at Sourceforge: http://sourceforge.net/projects/formulator/ 2. Phillip Auersperg from bluedynamics.com uses ObjectDomain quiet heavily, since it has a nice JPython API that comes with it. He already built a reverse-engineering tool for ZClasses and is now going to write another tool to automatically generate DBObjects from a UML diagram in ObjectDomain. I am very excited about this tool, since it will make the already fast DBObject development even faster. As soon as Formulator goes into 1.0, I am going to think about binding the Formulator to DBObjects, so you can quickly generate forms for each object. In Berlin in two weeks, we are going to discuss this integration in more detail... Bluedynamics URL: http://www.bluedynamics.com DBObjects/SmartObjects URL: http://demo.iuveno-net.de I am very glad to see that we all have the same vision. We all just need to work more together (especially me included). This way we can have some strikes against the big ones... Regards, Stephan -- Stephan Richter CBU - Physics and Chemistry Student Web2k - Web Design/Development & Technical Project Management

Erik Enge

10:15 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Stephan Richter wrote:

...

1. We are building a wizard that asks you all the necessary questions to generate a basic class framwework.

Sounds exactly like my mk-zprod.

...

If anyone is interested in helping developing that tool (which will be released under the GPL as all of the iuveno products), then I can make an electronic copy of my personal notes and I setup a CVS.

If we are working to solve the same problem I'd love to lend a hand. If you can, it would be great to see your personal notes. mk-zprod is just an easy way to set up classes that are compliant to the Zope API, but I have some thoughts how one could make it into a pluggable api/language compliant-code-generator, but that might be overkill. It's definantly an overkill :).

Morten W. Petersen

5:55 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Erik Enge wrote:

...

On Tue, 26 Jun 2001, Morten W. Petersen wrote:

...
How about meta-programming (designing) via the Zope interface, with UML or somesuch; automatically generating Python code, then enable designers to use a ZFormulator-ish product to edit the interface while a programmer can work on the 'backend' (emacs on a terminal)?

What are you on my friend? ;-)

Well, it's quite logical: UML can be used to map out both software and business development (they are, after all, two sides of the same story), the designer can twiddle-n-polish the interface and the programmer can take care of 'exceptional tasks' that can't easily be taken care of via the UML interface without adding too much complexity. -Morten

Morten W. Petersen

6:26 p.m.

New subject: Something better than ZClasses (was: Re: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?))

On Tue, 26 Jun 2001, Morten W. Petersen wrote:

...

Well, it's quite logical: UML can be used to map out both software and business development (they are, after all, two sides of the same story), the designer can twiddle-n-polish the interface and the programmer can take care of 'exceptional tasks' that can't easily be taken care of via the UML interface without adding too much complexity.

The UML interface may be a bit far fetched, but that's because nobody has done it yet. ;-) -Morten

Eric Roby

8:02 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

Erik Enge wrote on 26 June:

...

If your application can't be written in five minutes and you expect to use it more than once, you shouldn't use ZClasses - IMO. The only argument for ZClasses (that I had at the time) was that it was very easy and fast to set up a couple of classes and some instances. After I wrote mk-zprod, making Python Products is even faster than ZClasses, and certainly scales better.

Thank you for your thoughts. The comments throughout this thread have been very insightful. I see where I need to go from here. I will be checking out mk-zprod. I have a nagging question that you might be able to help me with (in light of mk-zprod). I understand that the 'Class-Id" that a ZClass gets assigned upon creations is vital to Zope's ability to manage class instances. Is it possible to make a Python class to replace the ZClasses I have already created and be able to support the ZClass instances I have already created? If so, how do you get a Python class to answer to a specific "Class-Id".

Erik Enge

9:50 p.m.

New subject: [Zope-dev] Re: Zcatalog bloat problem (berkeleydb is a solution?)

On Tue, 26 Jun 2001, Eric Roby wrote:

...

I will be checking out mk-zprod.

If you find it useful I can upgrade it to the next release which I've been think of for some time now (I was actually going to do it some months ago, but someone let all my time out... I'm tracking it down now.. Muhaha).

...

Is it possible to make a Python class to replace the ZClasses I have already created and be able to support the ZClass instances I have already created?

Don't know, you could probably write a script to convert the instances...

9070

Age (days ago)

9074

Last active (days ago)

List overview

61 comments

14 participants

participants (14)

abel deuring
Andy McKay
barry＠digicool.com
Casey Duncan
Chris McDonough
Chris Withers
Eric Roby
Erik Enge
Giovanni Maruzzelli
Jeff Sasmor
Michel Pelletier
Morten W. Petersen
Stephan Richter
Toby Dickenson