updating catalog takes forever (never really completes)
I have a ZCatalog with about 7,000 cataloged objects. When I try and update the catalog (the advanced tab), Zope seems to just spin on it forever. After a while it becomes unuseable. Zope's been updating the catalog for about 5 hours now and hasn't completed. What's going on and why is it taking so long? I set the threshold to 100 objects. The last time I tried it with the default of 10,000 it did the same thing. Thanks, -Chris
Hi peoples. :-) Well, the catalog update finally completed. I don't know exactly how long it too, but I know it was more than 6 or 7 hours. There are a fair number of indexes and meta-data fields. Could this be the reason why it's sooooo, sssllllooowww? I'd also like to note that I had to update the catalog because after removing two meta-data fields, a lot of the other meta-data had the wrong values in it. Say I had foo, bar and baz as meta-data fields. If I remove bar, foo would have baz's data and baz foo's. Ideas on that one? Known bug? Side-effect of too many meta-data fields? Thanks, -Chris On Sat, 21 Sep 2002 23:14:11 -0500, Christopher N. Deckard spoke forth:
I have a ZCatalog with about 7,000 cataloged objects. When I try and update the catalog (the advanced tab), Zope seems to just spin on it forever. After a while it becomes unuseable. Zope's been updating the catalog for about 5 hours now and hasn't completed. What's going on and why is it taking so long?
I set the threshold to 100 objects. The last time I tried it with the default of 10,000 it did the same thing.
Thanks, -Chris
Christopher N. Deckard writes:
Well, the catalog update finally completed. I don't know exactly how long it too, but I know it was more than 6 or 7 hours. Fine.
There are a fair number of indexes and meta-data fields. Could this be the reason why it's sooooo, sssllllooowww? For each entry in the MetaData table and each index, the correspondingly named method is called on each objects in your catalog.
If you have lots of metadata or lots of indexes or expensive methods or lots of objects, this may take some time...
I'd also like to note that I had to update the catalog because after removing two meta-data fields, a lot of the other meta-data had the wrong values in it. Say I had foo, bar and baz as meta-data fields. If I remove bar, foo would have baz's data and baz foo's. Ideas on that one? Known bug? Side-effect of too many meta-data fields? Its an implementation side effect: The meta data for an objects is stored as tuple; it does not contain the names. Therefore, when you change the metadata scheme, the tuples are no longer in sync and you must rebuild the metadata tuples for each object.
Dieter
Dieter Maurer <dieter@handshake.de> wrote:
Christopher N. Deckard writes:
I'd also like to note that I had to update the catalog because after removing two meta-data fields, a lot of the other meta-data had the wrong values in it. Say I had foo, bar and baz as meta-data fields. If I remove bar, foo would have baz's data and baz foo's. Ideas on that one? Known bug? Side-effect of too many meta-data fields?
Its an implementation side effect: The meta data for an objects is stored as tuple; it does not contain the names. Therefore, when you change the metadata scheme, the tuples are no longer in sync and you must rebuild the metadata tuples for each object.
Well, let's say it's a bug, period. :-) Florent -- Florent Guillaume, Nuxeo (Paris, France) +33 1 40 33 79 87 http://nuxeo.com mailto:fg@nuxeo.com
Florent Guillaume writes:
Dieter Maurer <dieter@handshake.de> wrote:
Christopher N. Deckard writes: ... deleting catalog meta data columns makes meta data inconsistent ...
Its an implementation side effect: The meta data for an objects is stored as tuple; it does not contain the names. Therefore, when you change the metadata scheme, the tuples are no longer in sync and you must rebuild the metadata tuples for each object.
Well, let's say it's a bug, period. :-) I am too pragmatic to call it a bug...
Dieter
Dieter Maurer wrote:
Its an implementation side effect: The meta data for an objects is stored as tuple; it does not contain the names. Therefore, when you change the metadata scheme, the tuples are no longer in sync and you must rebuild the metadata tuples for each object.
This is something I don't like about Zope: that the index isn't just an index. If it wasn't so expensive to get the original objects there would be no need to keep around this (duplicated,redundant) data. Bye -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
On Monday 23 Sep 2002 8:45 am, Luca Olivetti wrote:
This is something I don't like about Zope: that the index isn't just an index. If it wasn't so expensive to get the original objects there would be no need to keep around this (duplicated,redundant) data.
Im not sure what you are implicitly comparing Zope with.... Relational databases keep redundant copies of their data in indexes too. ?
Toby Dickenson wrote:
This is something I don't like about Zope: that the index isn't just an index. If it wasn't so expensive to get the original objects there would be no need to keep around this (duplicated,redundant) data.
Im not sure what you are implicitly comparing Zope with.... Relational databases keep redundant copies of their data in indexes too.
In indexes, yes, that's not really redundant, is an index after all. But metadata? It seems it is just a workaround to a limitation of the zodb. -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
Luca Olivetti wrote:
In indexes, yes, that's not really redundant, is an index after all. But metadata? It seems it is just a workaround to a limitation of the zodb.
Urm, no, it is a form of caching. Have you heard of apps like Squid, Mozilla, Internet Explorer ? ;-) cheers, Chris
Chris Withers wrote:
Luca Olivetti wrote:
In indexes, yes, that's not really redundant, is an index after all. But metadata? It seems it is just a workaround to a limitation of the zodb.
Urm, no, it is a form of caching.
Have you heard of apps like Squid, Mozilla, Internet Explorer ? ;-)
Yes, caching is a workaround for dealing with slow backends, with the risk of getting stale data. If the backend is fast enough there should be no need for caching (and for duplicating the same data again and again). -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
Am Mon, 2002-09-23 um 15.09 schrieb Luca Olivetti:
Chris Withers wrote:
Luca Olivetti wrote:
In indexes, yes, that's not really redundant, is an index after all. But metadata? It seems it is just a workaround to a limitation of the zodb.
Urm, no, it is a form of caching.
Have you heard of apps like Squid, Mozilla, Internet Explorer ? ;-)
Yes, caching is a workaround for dealing with slow backends, with the risk of getting stale data. Well, nothing magic here. If you got CatalogAware object, your Catalog gets reindexed automatically the same way an index in database system If the backend is fast enough there should be no need for caching (and for duplicating the same data again and again). Well, every backend hits somewhere a ceiling. Zope is (depending upon your methods, in my test with some basic navigation stuff) not really that much slower than an static Apache fileserver. (I get about 60-70% of requests compared to static file on apache).
Now every website that offers "Search the website" functionality has some kind of search engine. And all these have index files that can become stale. Zope's solution offer only benefit over the competition: -) it's more flexible (because Zope is dynamic). Nobody considers using a search engine on it's website to provide navigation. -) with the right kind of objects (and plain OFS objects are not really useable for content) Catalog are never stale, because they are autoupdated. -) it's builtin and it's fast. -) it can do quite expressive searches, because it's more than just a full text database. Basically: Yes it's reduant data. And yes we all learned that reduant data is bad at University. Well, and some of us learned in the real-world that sometimes reduant data is not bad at all :) Especially in this case, as it is just an cache, that can be automatically validated. Well, OTOH, perhaps your are that pure in your thinking, ... But then you should turn of the CPU caches off too, that is reduant data too. (And with write back caches it's not even reduant data that is NOT IN SYNC!) Seems to be you've missed some courses on system design ;) Andreas
Andreas Kostyrka wrote:
Zope's solution offer only benefit over the competition: -) it's more flexible (because Zope is dynamic). Nobody considers using a search engine on it's website to provide navigation. -) with the right kind of objects (and plain OFS objects are not really useable for content) Catalog are never stale, because they are autoupdated. -) it's builtin and it's fast. -) it can do quite expressive searches, because it's more than just a full text database.
-) It opens the possibility for the objects (documents) to offer the data to be cataloged independend from the presentation - as opposed to search engines like htdig etc. Filtering out html/wml/xml/pdf/... markup doesn't belong to the indexing engine IMHO, the object itself knows best how its content should be indexed. cheers, oliver
Andreas Kostyrka wrote:
Zope's solution offer only benefit over the competition: -) it's more flexible (because Zope is dynamic). Nobody considers using a search engine on it's website to provide navigation. -) with the right kind of objects (and plain OFS objects are not really useable for content) Catalog are never stale, because they are autoupdated. -) it's builtin and it's fast. -) it can do quite expressive searches, because it's more than just a full text database.
You don't have to sell me zope, I already like it ;-) Now, all the points you have listed are related to the indexes *not* the metadata.
Basically: Yes it's reduant data. And yes we all learned that reduant data is bad at University. Well, and some of us learned in the real-world that sometimes reduant data is not bad at all :) Especially in this case, as it is just an cache, that can be automatically validated.
Automatically validated, yes, automatically generated/maintained, no, it's an additional burden to the programmer.
Well, OTOH, perhaps your are that pure in your thinking, ... But then you should turn of the CPU caches off too, that is reduant data too. (And with write back caches it's not even reduant data that is NOT IN SYNC!)
In an ideal world these would be considered as ugly workarounds for something that's either too slow or to expensive to achieve (and, BTW, most of these are transparent to the programmer, while the addition and use of metadata to the index is not). OTOH I know we don't live in an ideal world, though I like to dream ;-) -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
Luca Olivetti writes:
In an ideal world these would be considered as ugly workarounds for something that's either too slow or to expensive to achieve (and, BTW, most of these are transparent to the programmer, while the addition and use of metadata to the index is not). Someone already told you this. Thus, this is a repetition:
Delete all columns in the catalog's MetaData table, when you do not like them. Dieter
Dieter Maurer wrote:
Luca Olivetti writes:
In an ideal world these would be considered as ugly workarounds for something that's either too slow or to expensive to achieve (and, BTW, most of these are transparent to the programmer, while the addition and use of metadata to the index is not). Someone already told you this. Thus, this is a repetition:
Delete all columns in the catalog's MetaData table, when you do not like them.
Are you suggesting that accessing the objects' attributes will be no more expensive than accessing the metadata? -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
Luca Olivetti wrote:
Are you suggesting that accessing the objects' attributes will be no more expensive than accessing the metadata?
No, but since you've decided that every form of caching is bad, you should do this, and then optimise ZCatalog (and all attributes you want to index that are methods) so that no metadata is required. That then only leaves you the problem of solving the situation where you want to do batching of results without having to load every object you want to use into memory. Oh, and the situation where, for example, you only want to know the URL and name of the object in the search result and don't want to have to drag the whole of, say, a 10Mb file object into memory just to do so. But still, since you're happy to criticise the design of microprocessor architecure, I'm guessing this should be pretty easy for you ;-) cheers, Chris
Chris Withers wrote:
Oh, and the situation where, for example, you only want to know the URL and name of the object in the search result and don't want to have to drag the whole of, say, a 10Mb file object into memory just to do so.
This is what I was objecting to in the first place (not criticizing microprocessor design neither deciding that every form of caching is bad): it shouldn't be necessary to load a 10Mb object into memory just to access its name. Bye -- Luca Olivetti Wetron Automatización S.A. http://www.wetron.es/ Tel. +34 93 5883004 Fax +34 93 5883007
On Wednesday 25 Sep 2002 2:32 pm, Luca Olivetti wrote:
Chris Withers wrote:
Oh, and the situation where, for example, you only want to know the URL and name of the object in the search result and don't want to have to drag the whole of, say, a 10Mb file object into memory just to do so.
This is what I was objecting to in the first place (not criticizing microprocessor design neither deciding that every form of caching is bad): it shouldn't be necessary to load a 10Mb object into memory just to access its name.
Yes, that would be a badly designed Zope object. Zope's standard File objects store file data in a number persistent objects sperate from the one persistent object which reperesents its ZMI aspects. To get the title you need to load an object which is probably no more than 1k, which contains the title, id, permissions, last modified time, cached content/type and size. If your objects are in a ZEO server, it might cost a network round trip. If they are in a FileStorage, it might cost a disk head seek. Even so, its fast. Possibly fast enough for you and many other uses. But ZCatalog metadata is faster.
Luca Olivetti wrote:
This is what I was objecting to in the first place (not criticizing microprocessor design neither deciding that every form of caching is bad): it shouldn't be necessary to load a 10Mb object into memory just to access its name.
It isn't, that's why ZCatalog stores metadata fields ;-) cheers, Chris
On Wednesday 25 Sep 2002 8:15 am, Luca Olivetti wrote:
Are you suggesting that accessing the objects' attributes will be no more expensive than accessing the metadata?
If it would help, I can suggest a patch that would slow down metadata access enough to make both operations run at the same speed. ;-)
Am Die, 2002-09-24 um 16.05 schrieb Luca Olivetti:
Basically: Yes it's reduant data. And yes we all learned that reduant data is bad at University. Well, and some of us learned in the real-world that sometimes reduant data is not bad at all :) Especially in this case, as it is just an cache, that can be automatically validated.
Automatically validated, yes, automatically generated/maintained, no, it's an additional burden to the programmer. That depends upon your objects. If you store your data in the right data container, it's maintained automatically. If you intend to catalog your data, I'd consider to store it in something CatalogPathAware, like my FlexDatas (http://www.zope.org/Members/yacc/FlexData/)
Well, OTOH, perhaps your are that pure in your thinking, ... But then you should turn of the CPU caches off too, that is reduant data too. (And with write back caches it's not even reduant data that is NOT IN SYNC!)
In an ideal world these would be considered as ugly workarounds for something that's either too slow or to expensive to achieve (and, BTW, most of these are transparent to the programmer, while the addition and
Not really. That's quite inexact. Take a class about Numerics and you will discover that caches are not transparent at all. Or take a look at the Linux kernel which has to deal with different kinds of caches, ... So talking about caches as being transparent is an oversimplification that is ok for an introduction, but it can bite you quite awesome if you forget about it.
use of metadata to the index is not). OTOH I know we don't live in an ideal world, though I like to dream ;-) Well, that's the nice thing about university. :) And well, it's the unnice thing about real world :(
Andreas -- Andreas Kostyrka <andreas@kostyrka.priv.at>
On Monday 23 Sep 2002 9:28 am, Luca Olivetti wrote:
Toby Dickenson wrote:
This is something I don't like about Zope: that the index isn't just an index. If it wasn't so expensive to get the original objects there would be no need to keep around this (duplicated,redundant) data.
In indexes, yes, that's not really redundant, is an index after all. But metadata? It seems it is just a workaround to a limitation of the zodb.
The limitations are not in the zodb, but in the objects which are being indexed. Having the ZCatalog cache methods is useful if the methods are slow, or the objects large. Folders do the same thing in a less flexible way; they always cache the metatype and id of their contained objects. If you are indexing lightweight objects with fast methods then you dont need to use ZCatalog metadata. Unlike Folders, ZCatalog gives you the choice.
participants (8)
-
Andreas Kostyrka -
Chris Withers -
Christopher N. Deckard -
Dieter Maurer -
Florent Guillaume -
Luca Olivetti -
Oliver Bleutgen -
Toby Dickenson