Re: [Zope] Intersection/Union of ZCatalog result sets
From: "+lupa+" <lupa@zurven.com>
Hello Jonathan, I'm on the digest and not the regular zope list, so I'm sending you this off list (feel free to post it if you find it useful). My CalendarX product for Plone has some examples in it of making separate queries to the catalog and combining them and unique-ing them.
In short, you can add two catalog query results together with a + sign. Or if you have one query (q1) and want to add another (q2) to it, you can use:
q1 += q2
That's all it takes.
If there is overlap between the result sets, you'll have duplicates, so you should run them through a "unique" routine -- there was a good one in the CMFCalendar.CalendarTool.py routines, which I extracted and put into a Python script in CalendarX.
Here's the code, in its entirety: # Unique the results of a query (query) results = [] rids = [] for item in query: rid = item.getRID() if not rid in rids: results.append(item) rids.append(rid)
Finally, related to an intersection, is a little routine for subtracting one query from another. In CalendarX, this is my queriesSubtract.py routine for subtracting any events in q2 from those in q1 (so this is q1 not in q2):
#make a list of RIDs in q2, using a list comprehension: q2rids = [item.getRID() for item in q2] #make a new q1 that contains only items with RIDs NOT in q2rids q1new = [item for item in q1 if not item.getRID() in q2rids] return q1new
Both these routines use the RID (record ID) of objects in the ZCatalog, which is fairly key to this type of job. I think with these basic examples you should be able to implement just about anything you need. But I also agree with Andreas that Dieter's AdvancedQuery product is terrifically useful for creating truly sensible queries on the ZCatalog.
Thanks for the response! (all ideas, comments and suggestions greatfully received!) In my particular case I am trying to combine the result sets from two different search on two different zcatalogs (ie. not two different searches of the same zcatalog). In this case the RIDs do not match (each zcatalog maintains its own RIDs). I am doing this to try to squeeze out some performance improvements from a ZCTextIndex. We have a zcatalog with about 1 million documents that we are full-text indexing and it no longer fits into memory (therefore requiring many disk i/o's during retrieval which is seriously degrading performance). Our zcatalog currently has 5 indexes: 4 minor indexes and one major index (the main ZCTextIndex). I am attempting to split the zcatalog into two separate zcatalogs: one containing the 4 minor indexes and one containing the ZCTextIndex. The hope is that the zcatalog containing only the ZCTextIndex will be smaller and will again fit into memory. The only difficulty is in combining the results from searches of two separate zcatalogs in an efficient manner. My best guess at this point is that I will have to patch the 'search' routine in ZCTextIndex to stop it from 'Lazifying' the result sets, so that I can join/intersect the result sets based on OIDs (instead of RIDs - which should be doable as the result sets prior to 'lazifying' are xxBTrees and the BTrees product comes with methods for join/intersection). I can then 'Lazify' the final result set and return it. At least that's the theory! Thanks again for you reponse, Jonathan
Jonathan Hobbs wrote:
I am doing this to try to squeeze out some performance improvements from a ZCTextIndex. We have a zcatalog with about 1 million documents that we are full-text indexing and it no longer fits into memory (therefore requiring many disk i/o's during retrieval which is seriously degrading performance).
Our zcatalog currently has 5 indexes: 4 minor indexes and one major index (the main ZCTextIndex). I am attempting to split the zcatalog into two separate zcatalogs: one containing the 4 minor indexes and one containing the ZCTextIndex. The hope is that the zcatalog containing only the ZCTextIndex will be smaller and will again fit into memory.
Why would it be smaller? You still need to load the indexes when you do a search, right? Or do you intend to index different objects in different catalogs? In that case couldn't you use the idxs attribute of ZCatalog::catalog_object(self, obj, uid=None, idxs=None, update_metadata=1)?
The only difficulty is in combining the results from searches of two separate zcatalogs in an efficient manner. My best guess at this point is that I will have to patch the 'search' routine in ZCTextIndex to stop it from 'Lazifying' the result sets, so that I can join/intersect the result sets based on OIDs (instead of RIDs - which should be doable as the result sets prior to 'lazifying' are xxBTrees and the BTrees product comes with methods for join/intersection). I can then 'Lazify' the final result set and return it. At least that's the theory!
Maybe do a version of ZCatalog (or rather Catalog) that uses OIDs as RIDs? Only problem is that OIDs are int64 and BTrees.IISet et al. uses int32. So you would need a IISet that take long.
From: "Johan Carlsson" <johanc@easypublisher.com>
I am doing this to try to squeeze out some performance improvements from
a
ZCTextIndex. We have a zcatalog with about 1 million documents that we are full-text indexing and it no longer fits into memory (therefore requiring many disk i/o's during retrieval which is seriously degrading performance).
Our zcatalog currently has 5 indexes: 4 minor indexes and one major index (the main ZCTextIndex). I am attempting to split the zcatalog into two separate zcatalogs: one containing the 4 minor indexes and one containing the ZCTextIndex. The hope is that the zcatalog containing only the ZCTextIndex will be smaller and will again fit into memory.
Why would it be smaller? You still need to load the indexes when you do a search, right? Or do you intend to index different objects in different catalogs? In that case couldn't you use the idxs attribute of ZCatalog::catalog_object(self, obj, uid=None, idxs=None, update_metadata=1)?
Moving only the ZCTextIndex (and its Lexicon) into a separate ZCatalog should result in a smaller ZCatalog, as the space required by the other 4 indexes (3 Field Indexes and another ZCTextIndex) will be located in a different folder - I am going to load the ZCatalog containing the main ZCTextIndex into a Temporary Folder (to hold it in memory). Both ZCatalogs will index the same documents (stored in a separate BTreeFolder2).
The only difficulty is in combining the results from searches of two separate zcatalogs in an efficient manner. My best guess at this point is that I will have to patch the 'search' routine in ZCTextIndex to stop it from 'Lazifying' the result sets, so that I can join/intersect the result sets based on OIDs (instead of RIDs - which should be doable as the result sets prior to 'lazifying' are xxBTrees and the BTrees product comes with methods for join/intersection). I can then 'Lazify' the final result set and return it. At least that's the theory!
Maybe do a version of ZCatalog (or rather Catalog) that uses OIDs as RIDs? Only problem is that OIDs are int64 and BTrees.IISet et al. uses int32. So you would need a IISet that take long.
Thanks for the 'heads-up'. I had hoped to use OIDs instead of RIDs, but hadn't considered the 64/32 bit problem. I'll have to see if I can find a 64bit BTrees package, and failing that, try to modify the existing package to use long ints - this just keeps getting better and better :) Thanks for the help! Jonathan
Jonathan Hobbs wrote:
From: "Johan Carlsson" <johanc@easypublisher.com>
Why would it be smaller? You still need to load the indexes when you do a search, right? Or do you intend to index different objects in different catalogs? In that case couldn't you use the idxs attribute of ZCatalog::catalog_object(self, obj, uid=None, idxs=None, update_metadata=1)?
Moving only the ZCTextIndex (and its Lexicon) into a separate ZCatalog should result in a smaller ZCatalog, as the space required by the other 4 indexes (3 Field Indexes and another ZCTextIndex) will be located in a different folder - I am going to load the ZCatalog containing the main ZCTextIndex into a Temporary Folder (to hold it in memory).
You could also always create an external (to ZCatalog) Id Generator Service, that can be accessed from both zcatalogs/catalogs to get a unique RID that can be used in both catalogs. Skiping the problem with longs and potentially the problem of supporting a modified version of BTrees. There's some code for making transition-aware counter that you might want have a look at. I guess it needs some improvements though? #This is browed from Zope 2.4.3 ZODB.tests.ConflictResolution from Persistence import Persistent #This PCounter doesn't provide a unique ID. #It does increment ones per call (even if several threads collide) #but the value returned will be +2 for both threads. class PCounter(Persistent): _value = 0 def __init__(self, val=None): if val is not None: if type(val)==IntType: self._value=val elif hasattr(val, '_count'): self._value=getattr(val, '_count',0) else: self._value=0 def __repr__(self): return self._value def getUniqueId(self): self._value = self._value + 1 return self._value def _p_resolveConflict(self, oldState, savedState, newState): savedDiff = savedState['_value'] - oldState['_value'] newDiff = newState['_value'] - oldState['_value'] oldState['_value'] = oldState['_value'] + savedDiff + newDiff return oldState class PCounter2(PCounter): def _p_resolveConflict(self, oldState, savedState, newState): raise ConflictError
Thanks for the 'heads-up'. I had hoped to use OIDs instead of RIDs, but hadn't considered the 64/32 bit problem. I'll have to see if I can find a 64bit BTrees package, and failing that, try to modify the existing package to use long ints - this just keeps getting better and better :)
Cool! I love to hear how this turns out, so please keep me posted :-)
Thanks for the help!
My pleasure :-)
Jonathan Hobbs wrote:
From: "Johan Carlsson" <johanc@easypublisher.com>
Why would it be smaller? You still need to load the indexes when you do a search, right? Or do you intend to index different objects in different catalogs? In that case couldn't you use the idxs attribute of ZCatalog::catalog_object(self, obj, uid=None, idxs=None, update_metadata=1)?
Moving only the ZCTextIndex (and its Lexicon) into a separate ZCatalog should result in a smaller ZCatalog, as the space required by the other 4 indexes (3 Field Indexes and another ZCTextIndex) will be located in a different folder - I am going to load the ZCatalog containing the main ZCTextIndex into a Temporary Folder (to hold it in memory).
You could also always create an external (to ZCatalog) Id Generator Service, that can be accessed from both zcatalogs/catalogs to get a unique RID that can be used in both catalogs. Skiping the problem with longs and potentially the problem of supporting a modified version of BTrees.
There's some code for making transition-aware counter that you might want have a look at. I guess it needs some improvements though?
#This is browed from Zope 2.4.3 ZODB.tests.ConflictResolution from Persistence import Persistent #This PCounter doesn't provide a unique ID. #It does increment ones per call (even if several threads collide) #but the value returned will be +2 for both threads. class PCounter(Persistent): _value = 0 def __init__(self, val=None): if val is not None: if type(val)==IntType: self._value=val elif hasattr(val, '_count'): self._value=getattr(val, '_count',0) else: self._value=0 def __repr__(self): return self._value def getUniqueId(self): self._value = self._value + 1 return self._value def _p_resolveConflict(self, oldState, savedState, newState): savedDiff = savedState['_value'] - oldState['_value'] newDiff = newState['_value'] - oldState['_value'] oldState['_value'] = oldState['_value'] + savedDiff + newDiff return oldState
class PCounter2(PCounter): def _p_resolveConflict(self, oldState, savedState, newState): raise ConflictError
Thanks for the 'heads-up'. I had hoped to use OIDs instead of RIDs, but hadn't considered the 64/32 bit problem. I'll have to see if I can find a 64bit BTrees package, and failing that, try to modify the existing package to use long ints - this just keeps getting better and better :)
Cool! I love to hear how this turns out, so please keep me posted :-)
After some more digging around this was the approach I was going to try: 1) Build and populate a standard ZCatalog, then get the RIDs from the catalog for each entry. 2) Modify 'catalog_object' (and the underlying routines) to accept an optional RID parameter (use the passed RID instead of generating one internally). 3) Build the second ZCatalog, passing the RIDs from the first catalog 4) Modify the Lazy class to include a new routine LazyInt, which would be similar to LazyCat, but would do an intersection instead of a join (this would be the tricky bit). 5) Modify ZCatalog's 'searchResults' (and underlying 'search') routines to accept an optional parameter 'resultSet'. resultSet would be a lazy sequence returned from a previous ZCatalog search (the initial ZCatalog search would not pass a 'resultSet' parameter). This optional resultSet, if present, would be LazyInt'd with the result set generated by the current search. In theory (ha!) this should allow us to do two separate search on two separate catalogs then use the existing search machinery (aside from the new LazyInt) to marshall the results and present us with a normal lazy result set. But then we came up with a much MUCH simpler solution... We are going to encode all of the index data from the 4 other index fields and append them to the full-text field. We are then going to eliminate the other 4 indexes and only use the ZCTextIndex. Just before calling searchResults, we will programmatically (and transparently to the user) append the encoded fields we want the search to include. The intermediate result sets (created for each search term/word) are 'joined' by the existing search machinery. This will (in theory, yet again) give us a type of index search within ZCTextIndex. This allows us (hopefully) to maintain the functionality we need, reduce the index size/overhead, and improve search performance without having to hack ZCatalog (yeah!) I'll let you know if it actually works :-) Jonathan
Jonathan Hobbs wrote:
should run them through a "unique" routine -- there was a good one in the CMFCalendar.CalendarTool.py routines, which I extracted and put into a Python script in CalendarX.
Here's the code, in its entirety: # Unique the results of a query (query) results = [] rids = [] for item in query: rid = item.getRID() if not rid in rids: results.append(item) rids.append(rid)
That is about the worst way to implement a unique routine. As "rids" grow, you get a O^2 time. Using a dict is much faster, at least for larger result sets. unique = {} for item in query: unique[item.getRID()] = item return unique.values() You could probably even do it as: unique = {} for item in query: unique[item] = None return unique.keys() But that depends on whether the objects are hashable. -- hilsen/regards Max M, Denmark http://www.mxm.dk/ IT's Mad Science
participants (3)
-
Johan Carlsson -
Jonathan Hobbs -
Max M