Hi folks,
I've been playing around with adding a uniqueValuesFor capability to
Casey Duncan's CatalogQuery product.
This is more a proof-of-concept, so it isn't straightforward to use just
yet. There are probably bugs that I've overlooked, especially due to
assumptions about the types of things that BTrees and Buckets return.
Patches and explanations gratefully accepted!
Interestingly, I came up with two algorithms for uniqueValuesFor. One of
them uses the fact that the difference method of IMerge returns items,
not just keys.
The other algorithm is less tricky and more laborious.
I decided to test the algorithms using a Catalog I have lying around. It
isn't all that full, maybe a few hundred records, and a few tens of
unique values for the index I'm interested in.
I found that the first algorithm runs about ten times as fast as the second.
I also tried it on a larger Catalog (one with 2000 records), and the
first algorithm ran 30 times as fast as the second.
Attached are the diffs and the test script I used. The test script is
specific to a particular application I have, but it should be easy to
run it against another catalog.
Oh, by the way, Casey's CatalogQuery is a MonkeyPatch (those run-time
patches formerly known as hotfixes) on the ZCatalog product. I did, for
just a second, consider releasing this as a MonkeyPatch product that
patches CatalogQuery. So, it would be a MonkeyPatch of a MonkeyPatch, or
perhaps a MonkeyMonkeyPatch.
--
Steve Alexander
Software Engineer
Cat-Box limited
import Zope, ZPublisher
#c=Zope.app().school.Instructors.Catalog
c=Zope.app().school.Pupils.Catalog
from Products.CatalogQuery import CatalogQuery
cq=CatalogQuery(c._catalog, 'current==1')
field="area_upper"
a1=cq(uniqueValuesFor=field)
a2=cq(uv2=field)
a3=cq(uv3=field)
from time import time
m1=time()
a1=cq(uniqueValuesFor=field)
m2=time()
a2=cq(uv2=field)
m3=time()
a3=cq(uv3=field)
m4=time()
print "a1 took", m2-m1
print "a2 took", m3-m2
print "a3 took", m4-m3
# ---- Catalog with a few hundred records
# a1 took 0.00448000431061
# a2 took 0.0652350187302
# a3 took 0.0657860040665
# ---- Catalog with 2000 records
# a1 took 0.0895169973373
# a2 took 2.71825802326
# a3 took 2.75240898132
*** original CatalogQuery.py
--- new CatalogQuery.py
***************
*** 43,48 ****
--- 43,51 ----
from Acquisition import Implicit
from Products.ZCatalog.Lazy import LazyMap
from BTrees.IIBTree import intersection, weightedIntersection, union, weightedUnion, difference, IISet
+ from BTrees import IOBTree
+ from BTrees.OIBTree import OISet
+ from types import IntType
CQL_re = re.compile(
('\s*([_\-a-zA-Z0-9]+)\s*'
***************
*** 167,173 ****
if hasattr(self, '_v_all_keys'):
del self._v_all_keys
! def __call__(self, md={}, used=None, **kw):
"""Execute the query on the catalog"""
# Combine any keywords into the namespace mapping
--- 170,176 ----
if hasattr(self, '_v_all_keys'):
del self._v_all_keys
! def __call__(self, md={}, used=None, uniqueValuesFor=None, uv2=None, uv3=None, **kw):
"""Execute the query on the catalog"""
# Combine any keywords into the namespace mapping
***************
*** 186,191 ****
--- 189,237 ----
w, rs = weightedIntersection(rs, r)
elif block.logic == 'or':
w, rs = weightedUnion(rs, r)
+
+ if uniqueValuesFor:
+ inverse_rs=IOBTree.IOSet(difference(self._getAllKeys(), rs))
+ index=self._catalog.indexes[uniqueValuesFor]
+ results=IOBTree.difference(index._unindex, inverse_rs)
+ resultset=OISet()
+ for l in results.values():
+ resultset.update(l)
+ self._cleanUp()
+ return resultset.keys()
+ if uv2:
+ index=self._catalog.indexes[uv2]
+ doc_ids=intersection(rs, IISet(index._unindex.keys()))
+ d={} # could use an OISet here for automatic sorted order
+ for k,v in index._index.items():
+ if isinstance(v, IntType):
+ if v in doc_ids:
+ d[k]=None
+ else:
+ for i in v.keys():
+ if i in doc_ids:
+ d[k]=None
+ continue
+ resultset=d.keys()
+ resultset.sort()
+ self._cleanUp()
+ return resultset
+ if uv3:
+ index=self._catalog.indexes[uv3]
+ doc_ids=intersection(rs, IISet(index._unindex.keys()))
+ resultset=OISet()
+ for k,v in index._index.items():
+ if isinstance(v, IntType):
+ if v in doc_ids:
+ resultset.insert(k)
+ else:
+ for i in v.keys():
+ if i in doc_ids:
+ resultset.insert(k)
+ continue
+ self._cleanUp()
+ return resultset.keys()
+
# Create a lazy result map of the results
if hasattr(rs, 'keys'): rs=rs.keys()