Re: Zope Mailing Lists and ZCatalog
Andy Dawkins wrote:
Michel
In case you are not aware, we at NIP currently host a complete archive of the Zope mailing lists that are publicly available.
Yep.
We are using ZCatalog to index all the messages from the Mailing list archives. To give you an idea of numbers, the Zope mailing list alone is over 30,000 messages.
The problem we have is getting that many objects in to the Catalog. If we load the objects in to the ZODB, then catalog them, the machine either runs out of memory or, if we lower the sub transactions, It runs out of hard drive space.
This is because you are indexing more content than you have virtual+tmp memory to store the transaction in. Zope is transaction, as I'm sure you know, so it has to store the transaction somewhere so it can roll it back if neccesary, and memory+tmp storage is where that goes (subtransactions are swapped out to tmp).
If we use CatalogAware to catalog the objects as they are imported the Catalog explodes to stupid sizes because CatalogAware doesn't support Sub transactions.
Subtransactions are a storage thing, and really don't have anything to do with catalogaware, if you have a subtransaction threshold set then subtransactions will be used for any cataloging operation, catalogaware or not.
We could solve these issues by regularly packing the database during the import, but it isn't a perfect solution.
I'm not sure what you mean with these last to paragraphs, it seems like you have two problems: 1) you are mass indexing and running out of memory 2) you are indexing lots of content quickly and your database is growing The answer to 1 is to not mass index and incrimentatly index over time. The answer to 2 is to use a storage that does not store old revisions, like berkeley storage.
Also as messages arrived over time the Catalog would once again explode dramatically,
Basically we(NIP) would like to know if you(Michel/DC) are planning to improve ZCatalog/CatalogAware, if you are planning a successor to ZCatalog or basically any information that could be useful to us regarding the current development and urgency of ZCatalog/CatalogAware.
There isn't anything wrong with the Catalog (for this particular problem), or at least, there isn't anything in the catalog to fix that would solve your problem. We've had customers index well over 50,000 objects; you just have to understand the resource constraints and work with them, for example, don't mass index, use storages that scale to high write environments, etc.
Thanks in advance for your assistance.
NP. -Michel
On Fri, 4 Aug 2000, Michel Pelletier wrote:
Andy Dawkins wrote:
The problem we have is getting that many objects in to the Catalog. If we load the objects in to the ZODB, then catalog them, the machine either runs out of memory or, if we lower the sub transactions, It runs out of hard drive space.
Don't lower the subtransactions too much; because of the way BTree works you wind up generating a *lot* more disk writes than you would think. I can catalog 61K records (a small amount of data for each record, though) on a machine with 256MB of memory. More memory is the easiest solution...
If we use CatalogAware to catalog the objects as they are imported the Catalog explodes to stupid sizes because CatalogAware doesn't support Sub transactions.
Subtransactions are a storage thing, and really don't have anything to do with catalogaware, if you have a subtransaction threshold set then subtransactions will be used for any cataloging operation, catalogaware or not.
I've imported my whole 61K object folder tree, and the resulting Data.fs file was about twice the size of the zexp file. That hardly sounds like "exploded", so maybe there's something odd in the way you are doing the import? You definately don't want to be committing transactions or subtransactions too often.
Also as messages arrived over time the Catalog would once again explode dramatically,
This is definately an issue for something like archiving a mailing list. It sounds like, in the current state of things, you really want to move to a non-transaction storage for the catalog.
There isn't anything wrong with the Catalog (for this particular problem), or at least, there isn't anything in the catalog to fix that would solve your problem. We've had customers index well over 50,000 objects; you just have to understand the resource constraints and work with them, for example, don't mass index, use storages that scale to high write environments, etc.
There has, however, been at least one posting from DC about the technology that underlies the catalog, the BTree. Apparently there *is* some tuning that can be done to make the BTree generate fewer object updates when modifications take place (something about parent objects getting updated unnecessarly, my hazy memory says). Is any active work being done on BTree? --RDM
I've been working on a mailman archive/search interface in zope. I choose not to do the search mechanisms in zope because I was under the impression that ZCatalog is great for object indexing but that it would not be ideal for mass text indexing with 100K+ objects and 100MBs+ of text. The comments below seem to indicate that its only problems with mass indexing and transaction storage which would both get mitigated by moving to a incremental indexing scheme. but wouldn't you run into performance problems on searches and getting available memory to powerup the catalog search? i guess what i'm looking for is a maxim on catalog usage in terms of number of objects/indexes and a machines specs? Curious Kapil btw a demo of my mailman search interface is at http://sindev.dyndns.org/TGrounds/archive_search Michel Pelletier wrote:
Andy Dawkins wrote:
Michel
In case you are not aware, we at NIP currently host a complete archive of the Zope mailing lists that are publicly available.
Yep.
We are using ZCatalog to index all the messages from the Mailing list archives. To give you an idea of numbers, the Zope mailing list alone is over 30,000 messages.
The problem we have is getting that many objects in to the Catalog. If we load the objects in to the ZODB, then catalog them, the machine either runs out of memory or, if we lower the sub transactions, It runs out of hard drive space.
This is because you are indexing more content than you have virtual+tmp memory to store the transaction in. Zope is transaction, as I'm sure you know, so it has to store the transaction somewhere so it can roll it back if neccesary, and memory+tmp storage is where that goes (subtransactions are swapped out to tmp).
If we use CatalogAware to catalog the objects as they are imported the Catalog explodes to stupid sizes because CatalogAware doesn't support Sub transactions.
Subtransactions are a storage thing, and really don't have anything to do with catalogaware, if you have a subtransaction threshold set then subtransactions will be used for any cataloging operation, catalogaware or not.
We could solve these issues by regularly packing the database during the import, but it isn't a perfect solution.
I'm not sure what you mean with these last to paragraphs, it seems like you have two problems:
1) you are mass indexing and running out of memory
2) you are indexing lots of content quickly and your database is growing
The answer to 1 is to not mass index and incrimentatly index over time. The answer to 2 is to use a storage that does not store old revisions, like berkeley storage.
Also as messages arrived over time the Catalog would once again explode dramatically,
Basically we(NIP) would like to know if you(Michel/DC) are planning to improve ZCatalog/CatalogAware, if you are planning a successor to ZCatalog or basically any information that could be useful to us regarding the current development and urgency of ZCatalog/CatalogAware.
There isn't anything wrong with the Catalog (for this particular problem), or at least, there isn't anything in the catalog to fix that would solve your problem. We've had customers index well over 50,000 objects; you just have to understand the resource constraints and work with them, for example, don't mass index, use storages that scale to high write environments, etc.
Thanks in advance for your assistance.
NP.
-Michel
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
participants (3)
-
Kapil Thangavelu -
Michel Pelletier -
R. David Murray