[Zope-dev] adding uniqueValuesFor to CatalogQuery

1 Sep 2001

      Hi folks,

I've been playing around with adding a uniqueValuesFor capability to 
Casey Duncan's CatalogQuery product.

This is more a proof-of-concept, so it isn't straightforward to use just 
yet. There are probably bugs that I've overlooked, especially due to 
assumptions about the types of things that BTrees and Buckets return.
Patches and explanations gratefully accepted!

Interestingly, I came up with two algorithms for uniqueValuesFor. One of 
them uses the fact that the difference method of IMerge returns items, 
not just keys.
The other algorithm is less tricky and more laborious.

I decided to test the algorithms using a Catalog I have lying around. It 
isn't all that full, maybe a few hundred records, and a few tens of 
unique values for the index I'm interested in.

I found that the first algorithm runs about ten times as fast as the second.

I also tried it on a larger Catalog (one with 2000 records), and the 
first algorithm ran 30 times as fast as the second.

Attached are the diffs and the test script I used. The test script is 
specific to a particular application I have, but it should be easy to 
run it against another catalog.

Oh, by the way, Casey's CatalogQuery is a MonkeyPatch (those run-time 
patches formerly known as hotfixes) on the ZCatalog product. I did, for 
just a second, consider releasing this as a MonkeyPatch product that 
patches CatalogQuery. So, it would be a MonkeyPatch of a MonkeyPatch, or 
perhaps a MonkeyMonkeyPatch.

--
Steve Alexander
Software Engineer
Cat-Box limited

import Zope, ZPublisher
#c=Zope.app().school.Instructors.Catalog
c=Zope.app().school.Pupils.Catalog
from Products.CatalogQuery import CatalogQuery
cq=CatalogQuery(c._catalog, 'current==1')
field="area_upper"
a1=cq(uniqueValuesFor=field)
a2=cq(uv2=field)
a3=cq(uv3=field)

from time import time
m1=time()
a1=cq(uniqueValuesFor=field)
m2=time()
a2=cq(uv2=field)
m3=time()
a3=cq(uv3=field)
m4=time()

print "a1 took", m2-m1
print "a2 took", m3-m2
print "a3 took", m4-m3

# ---- Catalog with a few hundred records
# a1 took 0.00448000431061
# a2 took 0.0652350187302
# a3 took 0.0657860040665
# ---- Catalog with 2000 records
# a1 took 0.0895169973373
# a2 took 2.71825802326
# a3 took 2.75240898132

*** original CatalogQuery.py
--- new CatalogQuery.py
***************
*** 43,48 ****
--- 43,51 ----
  from Acquisition import Implicit
  from Products.ZCatalog.Lazy import LazyMap
  from BTrees.IIBTree import intersection, weightedIntersection, union, weightedUnion, difference, IISet
+ from BTrees import IOBTree
+ from BTrees.OIBTree import OISet
+ from types import IntType

  CQL_re = re.compile(
      ('\s*([_\-a-zA-Z0-9]+)\s*'
***************
*** 167,173 ****
          if hasattr(self, '_v_all_keys'):
              del self._v_all_keys

!     def __call__(self, md={}, used=None, **kw):
          """Execute the query on the catalog"""

          # Combine any keywords into the namespace mapping
--- 170,176 ----
          if hasattr(self, '_v_all_keys'):
              del self._v_all_keys

!     def __call__(self, md={}, used=None, uniqueValuesFor=None, uv2=None, uv3=None, **kw):
          """Execute the query on the catalog"""

          # Combine any keywords into the namespace mapping
***************
*** 186,191 ****
--- 189,237 ----
                      w, rs = weightedIntersection(rs, r)
                  elif block.logic == 'or':
                      w, rs = weightedUnion(rs, r)
+ 
+         if uniqueValuesFor:
+             inverse_rs=IOBTree.IOSet(difference(self._getAllKeys(), rs))
+             index=self._catalog.indexes[uniqueValuesFor]
+             results=IOBTree.difference(index._unindex, inverse_rs)
+             resultset=OISet()
+             for l in results.values():
+                 resultset.update(l)
+             self._cleanUp()
+             return resultset.keys()
+         if uv2:
+             index=self._catalog.indexes[uv2]
+             doc_ids=intersection(rs, IISet(index._unindex.keys()))
+             d={}  # could use an OISet here for automatic sorted order
+             for k,v in index._index.items():
+                 if isinstance(v, IntType):
+                     if v in doc_ids:
+                         d[k]=None
+                 else:
+                     for i in v.keys():
+                         if i in doc_ids:
+                             d[k]=None
+                             continue
+             resultset=d.keys()                
+             resultset.sort()
+             self._cleanUp()
+             return resultset
+         if uv3:
+             index=self._catalog.indexes[uv3]
+             doc_ids=intersection(rs, IISet(index._unindex.keys()))
+             resultset=OISet()
+             for k,v in index._index.items():
+                 if isinstance(v, IntType):
+                     if v in doc_ids:
+                         resultset.insert(k)
+                 else:
+                     for i in v.keys():
+                         if i in doc_ids:
+                             resultset.insert(k)
+                             continue
+             self._cleanUp()
+             return resultset.keys()
+             

          # Create a lazy result map of the results
          if hasattr(rs, 'keys'): rs=rs.keys()