-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Roché Compaan wrote:
On Mon, 2008-10-27 at 11:32 -0500, Alan Runyan wrote:
I agree with Tres. A lot more can be done with Indexes and Catalog without caching.
The most exiciting development in Catalog optimizations comes out Jarn. Helge Tesdal (iirc) did a buncha work at a RDBMS company when he was in college. He has a protoype of a query plan for ZCatalog.
http://www.jarn.com/blog/catalog-query-plan
I would like to ask Roche and others to look at the Query Plan.
We looked at query plan but it didn't help us in any way. Some catalog indexes are performing very badly and most of our content is in a published state which doesn't help the query plan much.
Caching is a total PITA because invalidation machinery becomes overwhelming complex and unwieldly quickly in production.
I agree but this was the only thing that we could do to even go into production.
Since I'm in full agreement that we need to fix indexes that are problematic, I started doing some benchmarks on the large data set that gave us so many headaches. It is probably not surprising that the more complex indexes are performing badly. DateRangeIndex, KeywordIndex and Plone's ExtendedPathIndex performed the worst. Below are some stats showing timings around the "apply_index" call in Catalog.py that was done while testing the application with real data:
Index Name |Type |Avg Time |Calls/second ============================================================== object_implements |KeywordIndex |0.2172234| 4.6
This is clearly not the same issue as the other KeywordIndexes: in fact, I am astonished that anybody would be using a KeywordIndex for this at all. I would suspect that the real problem here is in the appliation, rather than the index itself.
getEffective_or_creat|DateIndex |0.1941770| 5.15 effectiveRange |DateRangeIndex |0.0086295| 115.88 allowedRolesAndUsers |KeywordIndex |0.0069754| 143.36
Hmm, I'm surprised there: what query is being passed to 'apply_index' for this call?
path |ExtendedPathIndex|0.0040614| 246.22
I don't trust the EPI implementation at all.
portal_type |FieldIndex |0.0025984| 384.84
This one is surprising: its performance should be pretty similar to the other FieldIndexes (e.g., 'review_state') which map a controlled vocabulary onto the entire corpus. Was the query different than 'review_state' (e.g., multi-valued vs. single-valued)?
SearchableText |ZCTextIndex |0.0007645| 1308.04 sourceUID |FieldIndex |0.0004886| 2046.31
Probably bogus, but I don't know how it is used.
UID |FieldIndex |0.0003070| 3257.1
Note that this is the worst-case scenario for a FieldIndex: there is exactly one value for every key. This shouldn't be "indexed" at all, in fact, beyond a simple BTree (UID -> rid).
targetUID |FieldIndex |0.0002287| 4372.12
I don't know what this one is used for, but it should probably be scrapped as well.
exact_getUserId |FieldIndex |0.0001931| 5177.79 exact_getUserName |FieldIndex |0.0001816| 5504.39
I don't know how the application uses either of those indexes, but they are almost certainly bogus in any normal catalog.
relationship |FieldIndex |0.0000822| 12153.1 id |FieldIndex |0.0000822| 12161.81 end |DateIndex |0.0000623| 16027.48 getGroups |FieldIndex |0.0000278| 35973.45
This is almost certainly bogus: FieldIndex is not supposed to be used with multi-valued terms.
getArtistTitle |FieldIndex |0.0000259| 38495.53 review_state |FieldIndex |0.0000259| 38582.22 Subject |KeywordIndex |0.0000253| 39413.57
This is the use-case for which KeywordIndex is designed. Was the query just a single term, by chance?
getDaysOfTheWeek |KeywordIndex |0.0000247| 40465.98 meta_type |FieldIndex |0.0000199| 50116.64 exact_getGroupId |FieldIndex |0.0000162| 61417.51 getVideoURL |FieldIndex |0.0000155| 64447.5 year |FieldIndex |0.0000155| 64460.43 Title |FieldIndex |0.0000136| 73381.01 getId |FieldIndex |0.0000131| 76056.97 Title |ZCTextIndex |0.0000128| 77809.46
This should be removed: there is no valid use case for doing a full-text search restricted only to the title.
startendrange |DateRangeIndex |0.0000127| 78485.82 expires |DateIndex |0.0000126| 79001.59 getObjPositionInParen|FieldIndex |0.0000124| 80675.9 targetId |FieldIndex |0.0000122| 81418.68 effective |DateIndex |0.0000121| 82651.7 getProvince |FieldIndex |0.0000117| 85198.54 month |FieldIndex |0.0000116| 85762.56 Description |ZCTextIndex |0.0000116| 86241.39
Again, should be removed.
Type |FieldIndex |0.0000115| 86345.17 getLast_login_time |DateIndex |0.0000115| 86698.98 Creator |FieldIndex |0.0000113| 87840.03 getEmail |ZCTextIndex |0.0000113| 87849.05
Should *definitely* be removed: how can you do full-text search on an e-mail address?
cmf_uid |FieldIndex |0.0000113| 88352.13 getDuration |FieldIndex |0.0000113| 88454.29 SearchableText |TextIndex |0.0000113| 88466.69
Where did this one come from? The 'SearchableText' above is a ZCTextIndex.
sortable_title |FieldIndex |0.0000112| 88698.49 getRating |FieldIndex |0.0000112| 88747.5 getGenres |KeywordIndex |0.0000112| 88796.55 object_provides |KeywordIndex |0.0000112| 88919.43 getEventType |KeywordIndex |0.0000112| 88953.9 in_reply_to |FieldIndex |0.0000112| 89057.46 getReview_state |FieldIndex |0.0000112| 89124.63 is_folderish |FieldIndex |0.0000112| 89240.51 getRawRelatedItems |KeywordIndex |0.0000111| 89568.91 getThumbSize |FieldIndex |0.0000111| 89653.89 getStudioCamURL |FieldIndex |0.0000111| 89678.92 Date |DateIndex |0.0000111| 89799.23 getHash |FieldIndex |0.0000111| 90111.54 getNumberOfComments |FieldIndex |0.0000110| 90141.88 start |DateIndex |0.0000110| 90400.59 getPercentage |FieldIndex |0.0000110| 90420.94 is_default_page |FieldIndex |0.0000110| 90446.4 modified |DateIndex |0.0000106| 93506.29 created |DateIndex |0.0000106| 93678.59 getGroupId |ZCTextIndex |0.0000105| 94962.39 getUserId |ZCTextIndex |0.0000105| 95165.88 getFullname |ZCTextIndex |0.0000104| 95313.06
Scrap these.
getRoles |FieldIndex |0.0000104| 95385.31 getUserName |ZCTextIndex |0.0000103| 96692.46
Scrap this one. Can you provide information on the corpus / configuration / test plan you used to generate these results? Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJGFxi+gerLs4ltQ4RAikKAJ93J3XyhecUq6kogTFsJnLXxgjh8QCgrZEw xdycjo0+4WVdSGFKhVtenms= =kYd+ -----END PGP SIGNATURE-----