ZCatalog sort_on considered harmful
I have observed suprisingly poor ZCatalog performance when using the sort_on parameter in some specific (but fairly common) circumstances. The problem is due to the loop begining line 507 in Catalog.py, which traverses the whole index of the sort key. It is very efficient if the result set contains items with duplicate sort key values. For example sort_on='author', since many objects may have the same 'Author' property. However it is not efficient if the result set does not contain duplicate sort values, and it can be very inefficient if the result set only contains a small subset of the values in the whole sort key index (for example, sort_on='create_date'). The change below switches between two different schemes, based on the relative sizes of the index and result set. Any comments? # sort. If there are scores, then this block is not # reached, therefor 'sort-on' does not happen in the # context of text index query. This should probably # sort by relevance first, then the 'sort-on' attribute. if len(rs)>len(sort_index._index): # We have more hits than _unique_ values in the sort-key index. # This means that many of the hits have the same value sort-key. # We can use the index to efficiently partition all the hits into # those groups, then sort the groups. for k, intset in sort_index._index.items(): if type(rs) is IIBType: intset=rs.intersection(intset) else: intset=intset.intersection(rs) if intset: append((k,LazyMap(self.__getitem__, intset))) else: # We have more unique values in the sort-key index than # we do hits. I think the indexes are of no help here. for r in rs: append(sort_index._unindex[r],LazyMap(self.__getitem__,[r])) (also added to the collector) Toby Dickenson
participants (1)
-
Toby Dickenson