[Zope-dev] catalog performance: query plan
Tres Seaver
tseaver at palladion.com
Mon Nov 10 11:08:02 EST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Roché Compaan wrote:
> On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote:
>> I agree with Tres. A lot more can be done with Indexes and Catalog
>> without caching.
>>
>> The most exiciting development in Catalog optimizations comes out
>> Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when
>> he was in college. He has a protoype of a query plan for ZCatalog.
>>
>> http://www.jarn.com/blog/catalog-query-plan
>>
>> I would like to ask Roche and others to look at the Query Plan.
>
> We looked at query plan but it didn't help us in any way. Some catalog
> indexes are performing very badly and most of our content is in a
> published state which doesn't help the query plan much.
>
>> Caching is a total PITA because invalidation machinery becomes
>> overwhelming complex and unwieldly quickly in production.
>>
>
> I agree but this was the only thing that we could do to even go into
> production.
>
> Since I'm in full agreement that we need to fix indexes that are
> problematic, I started doing some benchmarks on the large data set that
> gave us so many headaches. It is probably not surprising that the more
> complex indexes are performing badly. DateRangeIndex, KeywordIndex and
> Plone's ExtendedPathIndex performed the worst. Below are some stats
> showing timings around the "apply_index" call in Catalog.py that was
> done while testing the application with real data:
>
> Index Name |Type |Avg Time |Calls/second
> ==============================================================
> object_implements |KeywordIndex |0.2172234| 4.6
This is clearly not the same issue as the other KeywordIndexes: in
fact, I am astonished that anybody would be using a KeywordIndex for
this at all. I would suspect that the real problem here is in the
appliation, rather than the index itself.
> getEffective_or_creat|DateIndex |0.1941770| 5.15
> effectiveRange |DateRangeIndex |0.0086295| 115.88
> allowedRolesAndUsers |KeywordIndex |0.0069754| 143.36
Hmm, I'm surprised there: what query is being passed to 'apply_index'
for this call?
> path |ExtendedPathIndex|0.0040614| 246.22
I don't trust the EPI implementation at all.
> portal_type |FieldIndex |0.0025984| 384.84
This one is surprising: its performance should be pretty similar to the
other FieldIndexes (e.g., 'review_state') which map a controlled
vocabulary onto the entire corpus. Was the query different than
'review_state' (e.g., multi-valued vs. single-valued)?
> SearchableText |ZCTextIndex |0.0007645| 1308.04
> sourceUID |FieldIndex |0.0004886| 2046.31
Probably bogus, but I don't know how it is used.
> UID |FieldIndex |0.0003070| 3257.1
Note that this is the worst-case scenario for a FieldIndex: there is
exactly one value for every key. This shouldn't be "indexed" at all, in
fact, beyond a simple BTree (UID -> rid).
> targetUID |FieldIndex |0.0002287| 4372.12
I don't know what this one is used for, but it should probably be
scrapped as well.
> exact_getUserId |FieldIndex |0.0001931| 5177.79
> exact_getUserName |FieldIndex |0.0001816| 5504.39
I don't know how the application uses either of those indexes, but they
are almost certainly bogus in any normal catalog.
> relationship |FieldIndex |0.0000822| 12153.1
> id |FieldIndex |0.0000822| 12161.81
> end |DateIndex |0.0000623| 16027.48
> getGroups |FieldIndex |0.0000278| 35973.45
This is almost certainly bogus: FieldIndex is not supposed to be used
with multi-valued terms.
> getArtistTitle |FieldIndex |0.0000259| 38495.53
> review_state |FieldIndex |0.0000259| 38582.22
> Subject |KeywordIndex |0.0000253| 39413.57
This is the use-case for which KeywordIndex is designed. Was the query
just a single term, by chance?
> getDaysOfTheWeek |KeywordIndex |0.0000247| 40465.98
> meta_type |FieldIndex |0.0000199| 50116.64
> exact_getGroupId |FieldIndex |0.0000162| 61417.51
> getVideoURL |FieldIndex |0.0000155| 64447.5
> year |FieldIndex |0.0000155| 64460.43
> Title |FieldIndex |0.0000136| 73381.01
> getId |FieldIndex |0.0000131| 76056.97
> Title |ZCTextIndex |0.0000128| 77809.46
This should be removed: there is no valid use case for doing a
full-text search restricted only to the title.
> startendrange |DateRangeIndex |0.0000127| 78485.82
> expires |DateIndex |0.0000126| 79001.59
> getObjPositionInParen|FieldIndex |0.0000124| 80675.9
> targetId |FieldIndex |0.0000122| 81418.68
> effective |DateIndex |0.0000121| 82651.7
> getProvince |FieldIndex |0.0000117| 85198.54
> month |FieldIndex |0.0000116| 85762.56
> Description |ZCTextIndex |0.0000116| 86241.39
Again, should be removed.
> Type |FieldIndex |0.0000115| 86345.17
> getLast_login_time |DateIndex |0.0000115| 86698.98
> Creator |FieldIndex |0.0000113| 87840.03
> getEmail |ZCTextIndex |0.0000113| 87849.05
Should *definitely* be removed: how can you do full-text search on an
e-mail address?
> cmf_uid |FieldIndex |0.0000113| 88352.13
> getDuration |FieldIndex |0.0000113| 88454.29
> SearchableText |TextIndex |0.0000113| 88466.69
Where did this one come from? The 'SearchableText' above is a ZCTextIndex.
> sortable_title |FieldIndex |0.0000112| 88698.49
> getRating |FieldIndex |0.0000112| 88747.5
> getGenres |KeywordIndex |0.0000112| 88796.55
> object_provides |KeywordIndex |0.0000112| 88919.43
> getEventType |KeywordIndex |0.0000112| 88953.9
> in_reply_to |FieldIndex |0.0000112| 89057.46
> getReview_state |FieldIndex |0.0000112| 89124.63
> is_folderish |FieldIndex |0.0000112| 89240.51
> getRawRelatedItems |KeywordIndex |0.0000111| 89568.91
> getThumbSize |FieldIndex |0.0000111| 89653.89
> getStudioCamURL |FieldIndex |0.0000111| 89678.92
> Date |DateIndex |0.0000111| 89799.23
> getHash |FieldIndex |0.0000111| 90111.54
> getNumberOfComments |FieldIndex |0.0000110| 90141.88
> start |DateIndex |0.0000110| 90400.59
> getPercentage |FieldIndex |0.0000110| 90420.94
> is_default_page |FieldIndex |0.0000110| 90446.4
> modified |DateIndex |0.0000106| 93506.29
> created |DateIndex |0.0000106| 93678.59
> getGroupId |ZCTextIndex |0.0000105| 94962.39
> getUserId |ZCTextIndex |0.0000105| 95165.88
> getFullname |ZCTextIndex |0.0000104| 95313.06
Scrap these.
> getRoles |FieldIndex |0.0000104| 95385.31
> getUserName |ZCTextIndex |0.0000103| 96692.46
Scrap this one.
Can you provide information on the corpus / configuration / test plan
you used to generate these results?
Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 tseaver at palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFJGFxi+gerLs4ltQ4RAikKAJ93J3XyhecUq6kogTFsJnLXxgjh8QCgrZEw
xdycjo0+4WVdSGFKhVtenms=
=kYd+
-----END PGP SIGNATURE-----
More information about the Zope-Dev
mailing list