Hi guys, I've got a problem with ZCatalog. I've got plenty of large objects, ranging from 100KB to 100MB in size. Needless to say, these take up a lot of processor time when indexed by the ZCatalog. Now, these object have to be moved from time to time, only moved, so that one or two of the related columns and indexes are affected (the path and the parent container id); is there a way to go around this so that those two, the one column and the index are updated, and not, let's say the indexed body (which is 100KB - 100MB in size) ? Thanks, Morten
"Morten W. Petersen" wrote:
Hi guys,
I've got a problem with ZCatalog. I've got plenty of large objects, ranging from 100KB to 100MB in size. Needless to say, these take up a lot of processor time when indexed by the ZCatalog.
Now, these object have to be moved from time to time, only moved, so that one or two of the related columns and indexes are affected (the path and the parent container id); is there a way to go around this so that those two, the one column and the index are updated, and not, let's say the indexed body (which is 100KB - 100MB in size) ?
Thanks,
Morten
There is no built-in way. however I see no reason that the Catalog could not be extended to do this with a bit of python coding. A catalog contains a collection called indexes that contains the indexes themselves (what else?). An External method could be written like so (not tested): def partialReindexObject(catalog, indexes, uid, object): rid = catalog.uids.get[uid] for idx_key in indexes: idx = catalog.indexes[idx_key].__of__(catalog) idx.unindex_object(rid) idx.index_object(rid, object, None) I think that will do it. Pass the names of the indexes as a sequence in indexes, and data_record_id_ as the uid. The above does not update any meta-data in the catalog. -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>
Casey Duncan wrote:
"Morten W. Petersen" wrote:
Hi guys,
I've got a problem with ZCatalog. I've got plenty of large objects, ranging from 100KB to 100MB in size. Needless to say, these take up a lot of processor time when indexed by the ZCatalog.
Now, these object have to be moved from time to time, only moved, so that one or two of the related columns and indexes are affected (the path and the parent container id); is there a way to go around this so that those two, the one column and the index are updated, and not, let's say the indexed body (which is 100KB - 100MB in size) ?
Thanks,
Morten
There is no built-in way. however I see no reason that the Catalog could not be extended to do this with a bit of python coding.
A catalog contains a collection called indexes that contains the indexes themselves (what else?). An External method could be written like so (not tested):
def partialReindexObject(catalog, indexes, uid, object): rid = catalog.uids.get[uid]
for idx_key in indexes: idx = catalog.indexes[idx_key].__of__(catalog) idx.unindex_object(rid) idx.index_object(rid, object, None)
I think that will do it. Pass the names of the indexes as a sequence in indexes, and data_record_id_ as the uid.
The above does not update any meta-data in the catalog.
Actually what I wrote assumes you are passing a Catalog not a ZCatalog. So you will need to change it for a ZCatalog to: def partialReindexObject(zcatalog, indexes, uid, object): catalog = zcatalog._catalog rid = catalog.uids.get[uid] for idx_key in indexes: idx = catalog.indexes[idx_key].__of__(catalog) idx.unindex_object(rid) idx.index_object(rid, object, None) Sorry about that. -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>
[Casey Duncan] | Actually what I wrote assumes you are passing a Catalog not a ZCatalog. | So you will need to change it for a ZCatalog to: I figured that out. :-) There is one problem, the uids stored in the Catalog are based on the path of the object, so I guess I'll have to make a copy of the records, and then paste it in under the new uid (path). -Morten
I've got an analogous but different problem with ZCatalog udpates. I'd like to ask for ideas about the best way to handle this. I've got a ZPatterns based ap that is pulling data from a postgresql database. But associated with the object created out of the database are signficant chunks of html (Author bios, book descriptions). I've stored these html chunks in the database as well, though I don't think that matters to the problem or possible solution strategies. The issue is that parts of the database get updated periodically from an external source. That is, the author and book tables get replaced wholesale. But the Book and Author objects are cataloged using ZCatalog, doing a full text index on a combination of fields from the external database and the html chunks. As far as I can see at the moment, this means that every object has to get uncataloged and recataloged, meaning every time the database update happens the zodb inflates considerably, even though very little data has actually changed. So far I can think of two simple solutions to this problem: (1) pack often (2) put the Catalog into a mounted storage backed by a non-undoable storage. I'd rather not do either one of these (the first for obvious reasons, the second simply because I don't want to take the time to learn how to set up a non-undoable storage). Am I missing some other obvious options? It seems like there *ought* to be a way to avoid the overhead of updating the catalog for objects that haven't really changed. --RDM
Has the physical path of the object changed? If not, the newer (2.3.0 + ) catalog stuff should be smart enough to figure out whether anything inside the object has changed during catalog_object. If nothing has changed, none of the indexes or metadata columns should be updated. We're scrambling right now to write tests for this kind of thing. :-( Note that the algoritm is simple - for each index, compare the what exists in the index to what is to be put in. If they're the same, do nothing. If they're different, reindex. I wasn't able to understand completely from your description whether the object method your're attempting to index via a TextIndex actually returns different data or not when you recatalog it. Does it? There's a huge possibility that the merge stuff isn't working completely because until now there have been no deterministic tests. ----- Original Message ----- From: "R. David Murray " <bitz@bitdance.com> To: <zope-dev@zope.org> Sent: Saturday, March 03, 2001 2:29 PM Subject: Re: [Zope-dev] ZCatalog hackery
I've got an analogous but different problem with ZCatalog udpates. I'd like to ask for ideas about the best way to handle this.
I've got a ZPatterns based ap that is pulling data from a postgresql database. But associated with the object created out of the database are signficant chunks of html (Author bios, book descriptions). I've stored these html chunks in the database as well, though I don't think that matters to the problem or possible solution strategies.
The issue is that parts of the database get updated periodically from an external source. That is, the author and book tables get replaced wholesale. But the Book and Author objects are cataloged using ZCatalog, doing a full text index on a combination of fields from the external database and the html chunks. As far as I can see at the moment, this means that every object has to get uncataloged and recataloged, meaning every time the database update happens the zodb inflates considerably, even though very little data has actually changed.
So far I can think of two simple solutions to this problem: (1) pack often (2) put the Catalog into a mounted storage backed by a non-undoable storage. I'd rather not do either one of these (the first for obvious reasons, the second simply because I don't want to take the time to learn how to set up a non-undoable storage). Am I missing some other obvious options? It seems like there *ought* to be a way to avoid the overhead of updating the catalog for objects that haven't really changed.
--RDM
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
On Sat, 3 Mar 2001, Chris McDonough wrote:
Has the physical path of the object changed? If not, the newer (2.3.0 + )
Nope.
catalog stuff should be smart enough to figure out whether anything inside the object has changed during catalog_object. If nothing has changed, none of the indexes or metadata columns should be updated. We're scrambling right now to write tests for this kind of thing. :-(
Cool. Now, in the examples I've seen for interfacing ZCatalog and ZPatterns, the 'object updated' code does an unindex of the object and then an index of the object. I copied that pattern for my "the tables have been updated, reindex everything" code. So what I should do instead is just do an index of the objects? This leaves me with a different problem, though. Sometimes when the tables are updated objects dissapear (ie: were deleted from the external database). I have to figure out how to delete those from the catalog. A pain, but it shouldn't be too hard.
Note that the algoritm is simple - for each index, compare the what exists in the index to what is to be put in. If they're the same, do nothing. If they're different, reindex. I wasn't able to understand completely from your description whether the object method your're attempting to index via a TextIndex actually returns different data or not when you recatalog it. Does it?
Yeah, it should be returning exactly the same data. I can stand the update hit when the data actually changes (even though the change will typically be only one or two words out of hundreds).
There's a huge possibility that the merge stuff isn't working completely because until now there have been no deterministic tests.
Well, I don't know yet, since I was doing an unindex/reindex cycle. --RDM
Cool. Now, in the examples I've seen for interfacing ZCatalog and ZPatterns, the 'object updated' code does an unindex of the object and then an index of the object. I copied that pattern for my "the tables have been updated, reindex everything" code. So what I should do instead is just do an index of the objects?
Yup. Don't call uncatalog_object on the object before calling catalog_object.
This leaves me with a different problem, though. Sometimes when the tables are updated objects dissapear (ie: were deleted from the external database). I have to figure out how to delete those from the catalog. A pain, but it shouldn't be too hard.
This is a problem that ZCatalog doesn't address anyway (unless you're using "CatalogAware", blech).
Note that the algoritm is simple - for each index, compare the what exists in the index to what is to be put in. If they're the same, do nothing. If they're different, reindex. I wasn't able to understand completely from your description whether the object method your're attempting to index via a TextIndex actually returns different data or not when you recatalog it. Does it?
Yeah, it should be returning exactly the same data. I can stand the update hit when the data actually changes (even though the change will typically be only one or two words out of hundreds).
We're working on integrating new BTree data structures into the indexes which will make the time/space hit very low when updating documents in which nothing has changed. I have no hard numbers, however, for the old or the new stuff.
There's a huge possibility that the merge stuff isn't working completely because until now there have been no deterministic tests.
Well, I don't know yet, since I was doing an unindex/reindex cycle.
Well, give it a shot and tell me what happens! ;-)
[Chris McDonough] | Note that the algoritm is simple - for each index, compare the what exists | in the index to what is to be put in. If they're the same, do nothing. If | they're different, reindex. I wasn't able to understand completely from | your description whether the object method your're attempting to index via a | TextIndex actually returns different data or not when you recatalog it. | Does it? Will the new data be 'made ready for indexing' before it is compared to the existing data? That is, will ZCatalog have to compute the data in some way before it compares it to what is already stored? I'm wondering because it would be significant overhead to 'make a data field of 100MBs into an index-like value' and then compare it to what already exists in the ZCatalog. Thanks, Morten
Will the new data be 'made ready for indexing' before it is compared to the existing data? That is, will ZCatalog have to compute the data in some way before it compares it to what is already stored?
Yes.
I'm wondering because it would be significant overhead to 'make a data field of 100MBs into an index-like value' and then compare it to what already exists in the ZCatalog.
Yes, it's potentially slower on update but causes less database bloat. Most importantly the update scheme causes fewer ConflictErrors on heavy-write sites because we're writing to the ZODB less often. That said, some of the operations of comparison in the indexing code are moving from Python to C, which should offset this penalty a little.
participants (4)
-
Casey Duncan -
Chris McDonough -
morten@esol.no -
R. David Murray