I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted? I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog) -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog)
+1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here. Tres, - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF3tYk+gerLs4ltQ4RAmffAKCkerAZAvx3mpdThPO4EY9/Gjyf8wCfY7BQ kScipo29nNJy/LiJrLit0OY= =iay0 -----END PGP SIGNATURE-----
On Fri, 2007-02-23 at 06:55 -0500, Tres Seaver wrote:
Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog)
+1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here.
Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache? -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roché Compaan wrote:
On Fri, 2007-02-23 at 06:55 -0500, Tres Seaver wrote:
Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog) +1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here.
Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache?
I don't think so. Here are some random thoughts on the idea: - The 'searchResults' method must pass its keyword arguments as part of the cache key. - I don't know if there is a reasonable way to do 'mtime' for the catalog: we would like to be able to get an mtime cheaply for the BTrees (indexes, the 'data' container), but I don't know if that is possible. - The "right" place to do this feels like the 'searchResults' of ZCatalog, just before it calls 'self._catalog.searchResults'. - The CMF's catalog overrides 'searchResults', but calls it at the end, so everything there should work. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF3x0j+gerLs4ltQ4RAjx0AKDB90Y6tb6YuaA0JTLjQKmpAHuGuQCgl2O+ aTfsPyUxQ82fWsDmrCY7eHo= =47Ys -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tres Seaver wrote:
Roché Compaan wrote:
On Fri, 2007-02-23 at 06:55 -0500, Tres Seaver wrote:
Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog) +1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here. Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache?
I don't think so. Here are some random thoughts on the idea:
- The 'searchResults' method must pass its keyword arguments as part of the cache key.
- I don't know if there is a reasonable way to do 'mtime' for the catalog: we would like to be able to get an mtime cheaply for the BTrees (indexes, the 'data' container), but I don't know if that is possible.
- The "right" place to do this feels like the 'searchResults' of ZCatalog, just before it calls 'self._catalog.searchResults'.
- The CMF's catalog overrides 'searchResults', but calls it at the end, so everything there should work.
Hmm, on further thought: - It isn't safe to stash persistent objects in the RAM Cache manager, because they can't be used safely from another database connection. - The result set you get back from a query is a "lazy", which will be consumed by each client: no two clients will see the same thing. Maybe this won't work, after all. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF3x/i+gerLs4ltQ4RAjNpAJ91vQEBMsP3fBpZJB+JxKEu+xl3swCgt95/ 6hCCBcE5aasbWuN6DdGKqto= =SkEe -----END PGP SIGNATURE-----
On Fri, 2007-02-23 at 12:09 -0500, Tres Seaver wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tres Seaver wrote:
Roché Compaan wrote:
On Fri, 2007-02-23 at 06:55 -0500, Tres Seaver wrote:
Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog) +1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here. Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache?
I don't think so. Here are some random thoughts on the idea:
- The 'searchResults' method must pass its keyword arguments as part of the cache key.
- I don't know if there is a reasonable way to do 'mtime' for the catalog: we would like to be able to get an mtime cheaply for the BTrees (indexes, the 'data' container), but I don't know if that is possible.
- The "right" place to do this feels like the 'searchResults' of ZCatalog, just before it calls 'self._catalog.searchResults'.
- The CMF's catalog overrides 'searchResults', but calls it at the end, so everything there should work.
In my prototype I also wired the caching into searchResults: def searchResults(self, REQUEST=None, used=None, _merge=1, **kw): ... cache_key = None if args: cache_key = self._makeCacheKey(args) result = self._getCachedResult(cache_key) if result: return result return self._cacheResult(cache_key, self.search(args, sort_index, reverse, sort_limit, _merge))
Hmm, on further thought:
- It isn't safe to stash persistent objects in the RAM Cache manager, because they can't be used safely from another database connection.
But the lazy map of brains isn't persistent?
- The result set you get back from a query is a "lazy", which will be consumed by each client: no two clients will see the same thing.
I don't follow. The Lazy will contain a set of document ids that will be the same for all clients, not? I got satisfactory results by storing results in a volatile attribute (and they are not shared by clients). I'm still curious to see what can be achieved with ZCacheable to extend the lifetime of the cache. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
Tres Seaver wrote:
Tres Seaver wrote:
Roché Compaan wrote:
On Fri, 2007-02-23 at 06:55 -0500, Tres Seaver wrote:
Roché Compaan wrote:
I'm curious, has anybody played around with the idea of caching ZCatalog results and if I submitted a patch to do this would it be excepted?
I quickly coded some basic caching of results on a volatile attribute and I was really surprised with the amount of cache hits I got (especially with a Plone site that is a heavy user of the catalog) +1. I think using the 'ZCachable' stuff (e.g., adding a RAMCacheManager and associating a catalog to it) would be the sanest path here. Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache? I don't think so. Here are some random thoughts on the idea:
- The 'searchResults' method must pass its keyword arguments as part of the cache key.
- I don't know if there is a reasonable way to do 'mtime' for the catalog: we would like to be able to get an mtime cheaply for the BTrees (indexes, the 'data' container), but I don't know if that is possible.
- The "right" place to do this feels like the 'searchResults' of ZCatalog, just before it calls 'self._catalog.searchResults'.
- The CMF's catalog overrides 'searchResults', but calls it at the end, so everything there should work.
Hmm, on further thought:
- It isn't safe to stash persistent objects in the RAM Cache manager, because they can't be used safely from another database connection.
- The result set you get back from a query is a "lazy", which will be consumed by each client: no two clients will see the same thing.
Maybe this won't work, after all.
I have had little exposure to the actual innards of the Zope 2 catalog, so this might be too wild for good old Zope 2, but here it goes: In Zope3, catalog indices actually only deal with integers (int ids) that only *represent* actual objects (this is for optimization reasons, and so they don't actually have to hang on to persistent objects themselves). This way a Zope 3 catalog search result is actually a list of integers, not a list of objects (or brains, or whatever). Of couse, you can use that list of integers to look up the corresponding objects with an integer ID utility (and there is a convenience API for that, but that's not important). A wild guess is that the Zope 2 index does the same or at least somethign similar. Perhaps not in a nice componentized manner like Zope 3 does (using a separate utility for the int id mapping), but I do recall the ZCatalog storing "uids". RAM caching those integers should be absolutely possible. Heck, they don't even need much space and can be kept efficiently in data structures... It may require a bit of hacking the catalog, of course. Perhaps it's time to start thinking about componentizing the Zope 2 catalog to make such things easier in the future? -- http://worldcookery.com -- Professional Zope documentation and training Next Zope 3 training at Camp5: http://trizpug.org/boot-camp/camp5
On 2/23/07, Philipp von Weitershausen <philipp@weitershausen.de> wrote:
It may require a bit of hacking the catalog, of course. Perhaps it's time to start thinking about componentizing the Zope 2 catalog to make such things easier in the future?
Yup. It would also be interesting to look into making it faster which huge datasets, something that is a problem now. I think it's because you search each index separately and intersect the results, and only then pick out the 20 first results. Compared with Lucene for example, which instead will create iterators who will only resturn the "next match". This saves you from a lot of index searching when you have big results. I don't know if it is feasible to do something like that, but it would be interesting to look into it. -- Lennart Regebro: Python, Zope, CPS, Plone consulting. +33 661 58 14 64
On Fri, 2007-02-23 at 21:25 +0100, Lennart Regebro wrote:
On 2/23/07, Philipp von Weitershausen <philipp@weitershausen.de> wrote:
It may require a bit of hacking the catalog, of course. Perhaps it's time to start thinking about componentizing the Zope 2 catalog to make such things easier in the future?
Yup. It would also be interesting to look into making it faster which huge datasets, something that is a problem now. I think it's because you search each index separately and intersect the results, and only then pick out the 20 first results.
It is a "making it faster" urge that led me to thinking about caching results. I'm curious about your use case, the size of your dataset, and how you think Lucene might help you. We have an application that have about a million objects catalogued. With only a few objects in the catalog, a search take about 1 millisecond. This decreases logarithmically to 20 milliseconds for 500 000 objects and about 21 milliseconds for 1 million objects. 20 milliseconds is fast enough for most of our use cases, except for one use case where we add about 100 objects in a single transaction. These objects have Archetype references that lead to a massive amount of catalog queries. To be fair this is an Archetypes problem and not a catalog one, but it did proof to be an interesting optimisation exercise that lead me to thinking about caching ZCatalog results. In this particular case creating 100 objects lead to about 1000 catalog searches taking 20 milliseconds each. That is 20 seconds in total. So given the above, a application with a million objects using the ZCatalog can basically do 50 catalog searches in a second, if it wants to remain responsive to the user. Maybe this is more than enough, I don't know, but with apps like Plone that relies heavily on the catalog, optimisation of catalog operations can surely help improving scalability. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
Lennart Regebro wrote at 2007-2-23 21:25 +0100:
... Compared with Lucene for example, which instead will create iterators who will only resturn the "next match". This saves you from a lot of index searching when you have big results.
I don't know if it is feasible to do something like that, but it would be interesting to look into it.
It is done in "IncrementalSearch2". <http://www.dieter.handshake.de/pyprojects/zope> -- Dieter
On 3/6/07, Dieter Maurer <dieter@handshake.de> wrote:
Lennart Regebro wrote at 2007-2-23 21:25 +0100:
... Compared with Lucene for example, which instead will create iterators who will only resturn the "next match". This saves you from a lot of index searching when you have big results.
I don't know if it is feasible to do something like that, but it would be interesting to look into it.
It is done in "IncrementalSearch2".
Cool! This also needs to be in core. As most of your stuff. ;-) -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64
Roché Compaan wrote at 2007-2-23 18:44 +0200:
... Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache?
A RAMCache cannot cache the result afte "LazyMap" has been applied to it. The result before "LazyMap" can be cached and the cached value needs to be "LazyMap"ped before it is returned. -- Dieter
On Fri, 2007-02-23 at 20:43 +0100, Dieter Maurer wrote:
Roché Compaan wrote at 2007-2-23 18:44 +0200:
... Cool idea. I haven't done any coding involving OFS.Cache though. Looking at it briefly it looks like one can modify the catalog to subclass OFS.Cacheable and then use the ZCacheable_get, ZCacheable_set and ZCacheable_invalidate methods to interact with a cache manager. This needs to be pretty explicit though. Are there any side effects that I should guard against if the catalog subclasses OFS.Cache?
A RAMCache cannot cache the result afte "LazyMap" has been applied to it. The result before "LazyMap" can be cached and the cached value needs to be "LazyMap"ped before it is returned.
Thanks for that pointer. It's good that way, it should make invalidation easier. It could be as simple as invalidating any cached result that contains the documentId being indexed. Do you see any problem with the following invalidation strategy: If the 'documentId' exists (cataloging existing object), invalidate all cached result sets that contain the documentId. If the 'documentId' doesn't exist (cataloging new object), invalidate all result sets where the ids of indexes applied, are contained in the cache key for that result set. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
Roché Compaan wrote at 2007-2-23 22:00 +0200:
... Thanks for that pointer. It's good that way, it should make invalidation easier. It could be as simple as invalidating any cached result that contains the documentId being indexed. Do you see any problem with the following invalidation strategy:
If the 'documentId' exists (cataloging existing object), invalidate all cached result sets that contain the documentId.
If the 'documentId' doesn't exist (cataloging new object), invalidate all result sets where the ids of indexes applied, are contained in the cache key for that result set.
I see several problems: * the RAMCacheManager does not provide an API to implement this policy * a cache manager would need a special data structure to efficiently implement the policy (given a documentId, find all cached results containing the documentId). * Apparently, your cache key contains the indexes involved in producing the result. The problem with this is that these indexes are known only after the query has been performed: The catalog API allows indexes to respond to subqueries, that do not contain their own name. I use this feature to allow a "Managable RangeIndex" to transparently replace "effective, expires" queries. But otherwise, the feature is probably not used intensively. Of course, you can add the information *after* the query has been performed and use it for invalidation -- in a specialized cache manager. On the other hand, new objects are usually indexed with all available (and not only a few) indexes. While some of the indexes may not be able to determine a senseful value for the document, the standard indexes have problems to handle this properly ("ManagableIndex"es can) and the API does not propagate the information. -- Dieter
On Sat, 2007-02-24 at 09:48 +0100, Dieter Maurer wrote:
Roché Compaan wrote at 2007-2-23 22:00 +0200:
... Thanks for that pointer. It's good that way, it should make invalidation easier. It could be as simple as invalidating any cached result that contains the documentId being indexed. Do you see any problem with the following invalidation strategy:
If the 'documentId' exists (cataloging existing object), invalidate all cached result sets that contain the documentId.
If the 'documentId' doesn't exist (cataloging new object), invalidate all result sets where the ids of indexes applied, are contained in the cache key for that result set.
I see several problems:
* the RAMCacheManager does not provide an API to implement this policy
* a cache manager would need a special data structure to efficiently implement the policy (given a documentId, find all cached results containing the documentId).
Can you elaborate. Would and IISet be efficient?
* Apparently, your cache key contains the indexes involved in producing the result.
This is coincidental. I'm building a cache key from all arguments passed in as keyword arguments and on the REQUEST.
The problem with this is that these indexes are known only after the query has been performed:
The catalog API allows indexes to respond to subqueries, that do not contain their own name.
I use this feature to allow a "Managable RangeIndex" to transparently replace "effective, expires" queries.
But otherwise, the feature is probably not used intensively.
If these parameters are on the request or in keywords they will form part of the cache key.
Of course, you can add the information *after* the query has been performed and use it for invalidation -- in a specialized cache manager.
On the other hand, new objects are usually indexed with all available (and not only a few) indexes.
While some of the indexes may not be able to determine a senseful value for the document, the standard indexes have problems to handle this properly ("ManagableIndex"es can) and the API does not propagate the information.
I think it will not be trivial to implement invalidation that doesn't bite you. I thought of checking for document ids because invalidating results when a whole index changes might cause to many invalidations. For example, querying for the same UID of an object should yield a cached result most of the times. Indexing a new object's UID shouldn't invalidate the cached results for existing UID queries. Let's assume we have a specialised cache manager and a cache that copes with the subtleties of sub queries, do think that the invaliding the cache according to the logic I suggested would work? Can you think of cases where it can lead to stale results that one should guard against. -- Roché Compaan Upfront Systems http://www.upfrontsystems.co.za
Roché Compaan wrote at 2007-2-25 11:48 +0200:
...
I see several problems:
* the RAMCacheManager does not provide an API to implement this policy
* a cache manager would need a special data structure to efficiently implement the policy (given a documentId, find all cached results containing the documentId).
Can you elaborate. Would and IISet be efficient?
You need a mapping "documentId --> cached results" and (maybe) the inverse map in order to update the first one when cached results are invalidated for different reasons (e.g. timed out).
* Apparently, your cache key contains the indexes involved in producing the result.
This is coincidental. I'm building a cache key from all arguments passed in as keyword arguments and on the REQUEST.
The problem with this is that these indexes are known only after the query has been performed:
The catalog API allows indexes to respond to subqueries, that do not contain their own name.
I use this feature to allow a "Managable RangeIndex" to transparently replace "effective, expires" queries.
But otherwise, the feature is probably not used intensively.
If these parameters are on the request or in keywords they will form part of the cache key.
The are on the request but do not identify the index directly. Let's make an example: My "ValidityRange" "Managable RangeIndex" can respond to queries containing subqueries "effective <= some_time <= expires". The request contains "effective" and "expires" but does not indicate that in fact the "ValidityRange" index is relevant. When I correctly understood your policy, you wanted to invalidate any cached results depending on an index which is updated. Thus, how do you recognize (from the request alone) that you should invalidate all queries including "effective" and "expires" when the "ValidityRange" index is updsted?
... Let's assume we have a specialised cache manager and a cache that copes with the subtleties of sub queries, do think that the invaliding the cache according to the logic I suggested would work? Can you think of cases where it can lead to stale results that one should guard against.
Your original policy seemed quite conservative to me. You weaken conservatism when you want to handle more special cases for non invalidation (such as "do not invalidate UID query results, when new entries are added"). -- Dieter
participants (5)
-
Dieter Maurer -
Lennart Regebro -
Philipp von Weitershausen -
Roché Compaan -
Tres Seaver