[Zope-dev] catalog performance: query plan

Tres Seaver tseaver at palladion.com
Mon Nov 10 11:08:02 EST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Roché Compaan wrote:
> On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote:
>> I agree with Tres.  A lot more can be done with Indexes and Catalog
>> without caching.
>>
>> The most exiciting development in Catalog optimizations comes out
>> Jarn.  Helge Tesdal (iirc) did a buncha work  at a RDBMS company when
>> he was in college.  He has a protoype of a query plan for ZCatalog.
>>
>> http://www.jarn.com/blog/catalog-query-plan
>>
>> I would like to ask Roche and others to look at the Query Plan.
> 
> We looked at query plan but it didn't help us in any way. Some catalog
> indexes are performing very badly and most of our content is in a
> published state which doesn't help the query plan much.
> 
>> Caching is a total PITA because invalidation machinery becomes
>> overwhelming complex and unwieldly quickly in production.
>>
> 
> I agree but this was the only thing that we could do to even go into
> production.
> 
> Since I'm in full agreement that we need to fix indexes that are
> problematic, I started doing some benchmarks on the large data set that
> gave us so many headaches. It is probably not surprising that the more
> complex indexes are performing badly. DateRangeIndex, KeywordIndex and
> Plone's ExtendedPathIndex performed the worst. Below are some stats
> showing timings around the "apply_index" call in Catalog.py that was
> done while testing the application with real data:
> 
> Index Name           |Type             |Avg Time |Calls/second
> ==============================================================
> object_implements    |KeywordIndex     |0.2172234|         4.6

This is clearly not the same issue as the other KeywordIndexes:  in
fact, I am astonished that anybody would be using a KeywordIndex for
this at all.  I would suspect that the real problem here is in the
appliation, rather than the index itself.

> getEffective_or_creat|DateIndex        |0.1941770|        5.15
> effectiveRange       |DateRangeIndex   |0.0086295|      115.88
> allowedRolesAndUsers |KeywordIndex     |0.0069754|      143.36

Hmm, I'm surprised there:  what query is being passed to 'apply_index'
for this call?

> path                 |ExtendedPathIndex|0.0040614|      246.22

I don't trust the EPI implementation at all.

> portal_type          |FieldIndex       |0.0025984|      384.84

This one is surprising:  its performance should be pretty similar to the
 other FieldIndexes (e.g., 'review_state') which map a controlled
vocabulary onto the entire corpus.  Was the query different than
'review_state' (e.g., multi-valued vs. single-valued)?

> SearchableText       |ZCTextIndex      |0.0007645|     1308.04
> sourceUID            |FieldIndex       |0.0004886|     2046.31

Probably bogus, but I don't know how it is used.

> UID                  |FieldIndex       |0.0003070|      3257.1

Note that this is the worst-case scenario for a FieldIndex:  there is
exactly one value for every key.  This shouldn't be "indexed" at all, in
fact, beyond a simple BTree (UID -> rid).

> targetUID            |FieldIndex       |0.0002287|     4372.12

I don't know what this one is used for, but it should probably be
scrapped as well.

> exact_getUserId      |FieldIndex       |0.0001931|     5177.79
> exact_getUserName    |FieldIndex       |0.0001816|     5504.39

I don't know how the application uses either of those indexes, but they
are almost certainly bogus in any normal catalog.

> relationship         |FieldIndex       |0.0000822|     12153.1
> id                   |FieldIndex       |0.0000822|    12161.81
> end                  |DateIndex        |0.0000623|    16027.48
> getGroups            |FieldIndex       |0.0000278|    35973.45

This is almost certainly bogus:  FieldIndex is not supposed to be used
with multi-valued terms.

> getArtistTitle       |FieldIndex       |0.0000259|    38495.53
> review_state         |FieldIndex       |0.0000259|    38582.22
> Subject              |KeywordIndex     |0.0000253|    39413.57

This is the use-case for which KeywordIndex is designed.  Was the query
just a single term, by chance?

> getDaysOfTheWeek     |KeywordIndex     |0.0000247|    40465.98
> meta_type            |FieldIndex       |0.0000199|    50116.64
> exact_getGroupId     |FieldIndex       |0.0000162|    61417.51
> getVideoURL          |FieldIndex       |0.0000155|     64447.5
> year                 |FieldIndex       |0.0000155|    64460.43
> Title                |FieldIndex       |0.0000136|    73381.01
> getId                |FieldIndex       |0.0000131|    76056.97
> Title                |ZCTextIndex      |0.0000128|    77809.46

This should be removed:  there is no valid use case for doing a
full-text search restricted only to the title.

> startendrange        |DateRangeIndex   |0.0000127|    78485.82
> expires              |DateIndex        |0.0000126|    79001.59
> getObjPositionInParen|FieldIndex       |0.0000124|     80675.9
> targetId             |FieldIndex       |0.0000122|    81418.68
> effective            |DateIndex        |0.0000121|     82651.7
> getProvince          |FieldIndex       |0.0000117|    85198.54
> month                |FieldIndex       |0.0000116|    85762.56
> Description          |ZCTextIndex      |0.0000116|    86241.39

Again, should be removed.

> Type                 |FieldIndex       |0.0000115|    86345.17
> getLast_login_time   |DateIndex        |0.0000115|    86698.98
> Creator              |FieldIndex       |0.0000113|    87840.03
> getEmail             |ZCTextIndex      |0.0000113|    87849.05

Should *definitely* be removed:  how can you do full-text search on an
e-mail address?

> cmf_uid              |FieldIndex       |0.0000113|    88352.13
> getDuration          |FieldIndex       |0.0000113|    88454.29
> SearchableText       |TextIndex        |0.0000113|    88466.69

Where did this one come from?  The 'SearchableText' above is a ZCTextIndex.

> sortable_title       |FieldIndex       |0.0000112|    88698.49
> getRating            |FieldIndex       |0.0000112|     88747.5
> getGenres            |KeywordIndex     |0.0000112|    88796.55
> object_provides      |KeywordIndex     |0.0000112|    88919.43
> getEventType         |KeywordIndex     |0.0000112|     88953.9
> in_reply_to          |FieldIndex       |0.0000112|    89057.46
> getReview_state      |FieldIndex       |0.0000112|    89124.63
> is_folderish         |FieldIndex       |0.0000112|    89240.51
> getRawRelatedItems   |KeywordIndex     |0.0000111|    89568.91
> getThumbSize         |FieldIndex       |0.0000111|    89653.89
> getStudioCamURL      |FieldIndex       |0.0000111|    89678.92
> Date                 |DateIndex        |0.0000111|    89799.23
> getHash              |FieldIndex       |0.0000111|    90111.54
> getNumberOfComments  |FieldIndex       |0.0000110|    90141.88
> start                |DateIndex        |0.0000110|    90400.59
> getPercentage        |FieldIndex       |0.0000110|    90420.94
> is_default_page      |FieldIndex       |0.0000110|     90446.4
> modified             |DateIndex        |0.0000106|    93506.29
> created              |DateIndex        |0.0000106|    93678.59
> getGroupId           |ZCTextIndex      |0.0000105|    94962.39
> getUserId            |ZCTextIndex      |0.0000105|    95165.88
> getFullname          |ZCTextIndex      |0.0000104|    95313.06

Scrap these.

> getRoles             |FieldIndex       |0.0000104|    95385.31
> getUserName          |ZCTextIndex      |0.0000103|    96692.46

Scrap this one.



Can you provide information on the corpus / configuration / test plan
you used to generate these results?


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJGFxi+gerLs4ltQ4RAikKAJ93J3XyhecUq6kogTFsJnLXxgjh8QCgrZEw
xdycjo0+4WVdSGFKhVtenms=
=kYd+
-----END PGP SIGNATURE-----



More information about the Zope-Dev mailing list