[Zope-dev] catalog performance: query plan

Mon Nov 10 17:17:49 EST 2008

Lennart Regebro wrote:
> On Sun, Nov 9, 2008 at 19:58, Roché Compaan <roche at upfrontsystems.co.za> wrote:
>> Since I'm in full agreement that we need to fix indexes that are
>> problematic, I started doing some benchmarks on the large data set that
>> gave us so many headaches. It is probably not surprising that the more
>> complex indexes are performing badly. DateRangeIndex, KeywordIndex and
>> Plone's ExtendedPathIndex performed the worst. Below are some stats
>> showing timings around the "apply_index" call in Catalog.py that was
>> done while testing the application with real data:
> 
> ExtendedPathIndex doesn't need fixing, but we need to stop using it.
> It's done to support navigation trees from the catalog, but navigation
> should not be done via the same catalog as you do other things, but a
> dedicated tool. That would simplify and speed things up a lot. But OK,
> that's off-topic.
> 

I wander if this could be replaced by zc.relationship / plone.relations?

There is potential for removing the five.intid / zope.app.keyreference 
layer of indirection if the actual oid was stored instead, with an index 
to a list of database names packed into the first byte. There would even 
be room to store a reference to the objects class (using the pickle 
protocol 2 registry to convert this to an integer) in the next two or 
three bytes if creating ghosts were useful. This would still leave at 
least 32 bits of space (4 billion) for the actual object id.

Without storing the aq_chain explicitly we would need to ensure that 
__parent__ pointers were pickled for all content objects. The objects 
themselves could be used instead of metadata rows (without a security 
check it would be as simple as loading the oid from the relevant db 
connection). So long as all the required metadata was stored on the 
object itself only one load would be required for each object.

If this same keyreference were used in the indexes of the catalog 
instead of rowids then result sets could be merged.

The downside is that the set intersections would require double the 
memory of the current 32 bit ids.

Laurence