Re: [Zope] Intersection/Union of ZCatalog result sets

24 Sep 2004

...
Jonathan Hobbs wrote:
...
From: "Johan Carlsson" <johanc@easypublisher.com>
...
Why would it be smaller?
You still need to load the indexes when you do a search, right?
Or do you intend to index different objects in different catalogs?
In that case couldn't you use the idxs attribute
of ZCatalog::catalog_object(self, obj, uid=None, idxs=None,
update_metadata=1)?
Moving only the ZCTextIndex (and its Lexicon) into a separate ZCatalog
should result in a smaller ZCatalog, as the space required by the other 4
indexes (3 Field Indexes and another ZCTextIndex) will be located in a
different folder - I am going to load the ZCatalog containing the main
ZCTextIndex into a Temporary Folder (to hold it in memory).
You could also always create an external (to ZCatalog)  Id Generator
Service, that can be accessed from both zcatalogs/catalogs
to get a unique RID that can be used in both catalogs. Skiping the
problem with longs and potentially
the problem of supporting a modified version of BTrees.
There's some code for making transition-aware counter that you might
want have a look at.
I guess it needs some improvements though?
#This is browed from Zope 2.4.3 ZODB.tests.ConflictResolution
from Persistence import Persistent
#This PCounter doesn't provide a unique ID.
#It does increment ones per call (even if several threads collide)
#but the value returned will be +2 for both threads.
class PCounter(Persistent):
    _value = 0
    def __init__(self, val=None):
        if val is not None:
            if type(val)==IntType:
                self._value=val
            elif hasattr(val, '_count'):
                self._value=getattr(val, '_count',0)
            else:
                self._value=0
    def __repr__(self):
        return self._value
    def getUniqueId(self):
        self._value = self._value + 1
        return self._value
    def _p_resolveConflict(self, oldState, savedState, newState):
        savedDiff = savedState['_value'] - oldState['_value']
        newDiff = newState['_value'] - oldState['_value']
        oldState['_value'] = oldState['_value'] + savedDiff + newDiff
        return oldState
class PCounter2(PCounter):
    def _p_resolveConflict(self, oldState, savedState, newState):
        raise ConflictError
...
Thanks for the 'heads-up'.  I had hoped to use OIDs instead of RIDs, but
hadn't considered the 64/32 bit problem. I'll have to see if I can find a
64bit BTrees package, and failing that, try to modify the existing
package
to use long ints - this just keeps getting better and better :)
Cool!
I love to hear how this turns out, so please keep me posted :-)
After some more digging around this was the approach I was going to try:

1) Build and populate a standard ZCatalog, then get the RIDs from the
catalog for each entry.

2) Modify 'catalog_object' (and the underlying routines) to accept an
optional RID parameter (use the passed RID instead of generating one
internally).

3) Build the second ZCatalog, passing the RIDs from the first catalog

4) Modify the Lazy class to include a new routine LazyInt, which would be
similar to LazyCat, but would do an intersection instead of a join (this
would be the tricky bit).

5) Modify ZCatalog's 'searchResults' (and underlying 'search') routines to
accept an optional parameter 'resultSet'.  resultSet would be a lazy
sequence returned from a previous ZCatalog search (the initial ZCatalog
search would not pass a 'resultSet' parameter). This optional resultSet, if
present, would be LazyInt'd with the result set generated by the current
search.

In theory (ha!) this should allow us to do two separate search on two
separate catalogs then use the existing search machinery (aside from the new
LazyInt) to marshall the results and present us with a normal lazy result
set.

But then we came up with a much MUCH simpler solution...

We are going to encode all of the index data from the 4 other index fields
and append them to the full-text field.  We are then going to eliminate the
other 4 indexes and only use the ZCTextIndex.  Just before calling
searchResults, we will programmatically (and transparently to the user)
append the encoded fields we want the search to include. The intermediate
result sets (created for each search term/word) are 'joined' by the existing
search machinery.  This will (in theory, yet again) give us a type of index
search within ZCTextIndex.

This allows us (hopefully) to maintain the functionality we need, reduce the
index size/overhead, and improve search performance without having to hack
ZCatalog (yeah!)

I'll let you know if it actually works :-)

Jonathan