Combining hits from searching 2 ZCatalogs at once
Hi, I'm writing a Python script that searches 2 ZCatalogs at once with a form-provided query. The first ZCat contains scanned images of text documents plus their metadata. The second ZCat contains the OCR text of the same documents. The searches are written like this: textresults = context.Text_Catalog( {'PrincipiaSearchSource': query} ) ocrresults = context.OCR_Catalog( {'PrincipiaSearchSource': query} ) Text_Catalog is the primary catalog; it contains more metadata than in OCR_Catalog and hits on this catalog are preferred. OCR_Catalog is the secondary catalog; if I get a hit here (and assuming there's no hit on its matching document in Text_Catalog), then I want to find its matching document and add its metadata to textresults as a new hit. I'll then return the modified textresults to the calling form. My question is: How do I add the new hit to textresults? I tried textresults.append( newhit ), but I found out that textresults isn't a sequence, it's a LazyCat class instance. How do I append new items to this instance? Thanks, Gordon Lai __________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com
On Wednesday 30 April 2003 08:38 pm, Gordon Lai wrote:
Hi,
I'm writing a Python script that searches 2 ZCatalogs at once with a form-provided query. The first ZCat contains scanned images of text documents plus their metadata. The second ZCat contains the OCR text of the same documents. The searches are written like this:
textresults = context.Text_Catalog( {'PrincipiaSearchSource': query} ) ocrresults = context.OCR_Catalog( {'PrincipiaSearchSource': query} )
Text_Catalog is the primary catalog; it contains more metadata than in OCR_Catalog and hits on this catalog are preferred. OCR_Catalog is the secondary catalog; if I get a hit here (and assuming there's no hit on its matching document in Text_Catalog), then I want to find its matching document and add its metadata to textresults as a new hit. I'll then return the modified textresults to the calling form.
My question is: How do I add the new hit to textresults? I tried textresults.append( newhit ), but I found out that textresults isn't a sequence, it's a LazyCat class instance. How do I append new items to this instance?
You don't, but you can get the Catalog to return you "raw" sets which can be manipulated or added together with other catalogs. To do this, you must use an external method to call methods of the underlying Catalog object directly. Catalog.py has a function mergeResults which can be used to turn the raw sets into lazy catalog results like usual. Here is a simple example (external method): from Products.ZCatalog.Catalog import mergeResults def queryMultipleCatalogs(request, *zcatalogs): results = [] for zcat in zcatalogs: results.append(zcat1._catalog.searchResults(request, _merge=0)) sorted = request.has_key('sort-on') or request.has_key('sort_on') reverse = ((request.get('sort-order','') or request.get('sort_order','')).lower() in ('reverse','descending')) return mergeResults(results, sorted, reverse) The key is passing _merge=0 to searchResults. It then returns a raw result set. In the case of a sorted set this should be a standard Python list containing three tuples of (sort_key, docid/rid, catalog.__getitem__). These could be manipulated however you like. mergeResults can turn them back into standard catalog results. To learn more your best bet is to read the Catalog.py sources and use some of the methods in there as I have above. hth, -Casey
On Wed, Apr 30, 2003 at 10:58:40PM -0400, Casey Duncan wrote:
sorted = request.has_key('sort-on') or request.has_key('sort_on')
huh? -- Paul Winkler home: http://www.slinkp.com "Muppet Labs, where the future is made - today!"
Hi Casey, Casey Duncan wrote:
Here is a simple example (external method):
from Products.ZCatalog.Catalog import mergeResults
def queryMultipleCatalogs(request, *zcatalogs): results = [] for zcat in zcatalogs: results.append(zcat1._catalog.searchResults(request, _merge=0)) sorted = request.has_key('sort-on') or request.has_key('sort_on') reverse = ((request.get('sort-order','') or request.get('sort_order','')).lower() in ('reverse','descending')) return mergeResults(results, sorted, reverse)
Could something similar be used to solve this problem? http://mail.zope.org/pipermail/zope-dev/2003-April/019456.html http://mail.zope.org/pipermail/zope-dev/2003-May/019485.html cheers, Chris
Thanks a lot, your suggestions worked. However, I have another question: To get the metadata of a document that matches a hit in OCR_Catalog, I'm currently doing another search on absolute_url in Text_Catalog. This works fine, but if there are lots of OCR_Catalog hits, this really slows down the overall search. My strategy to speed things up is to avoid the extra search, read the document directly from disk, extract its metadata, and add it to textresults. However, I'm not sure how to create the object that's returned from searchResults() so that I can assign the metadata to it and then add it to textresults. I've looked through Catalog.py and found that searchResults() calls search(), which calls _apply_index() in TextIndex.py, which calls query(), which calls evaluate(). I then get lost in here because evaluate() doesn't seem to be evaluating anything; it reduces operators in the query but then doesn't seem to use the query to search an index. How do I create a searchResults() object? Thanks, Gordon --- Casey Duncan <casey@zope.com> wrote:
On Wednesday 30 April 2003 08:38 pm, Gordon Lai wrote:
Hi,
I'm writing a Python script that searches 2 ZCatalogs at once with a form-provided query. The first ZCat contains scanned images of text documents plus their metadata. The second ZCat contains the OCR text of the same documents. The searches are written like this:
textresults = context.Text_Catalog( {'PrincipiaSearchSource': query} ) ocrresults = context.OCR_Catalog( {'PrincipiaSearchSource': query} )
Text_Catalog is the primary catalog; it contains more metadata than in OCR_Catalog and hits on this catalog are preferred. OCR_Catalog is the secondary catalog; if I get a hit here (and assuming there's no hit on its matching document in Text_Catalog), then I want to find its matching document and add its metadata to textresults as a new hit. I'll then return the modified textresults to the calling form.
My question is: How do I add the new hit to textresults? I tried textresults.append( newhit ), but I found out that textresults isn't a sequence, it's a LazyCat class instance. How do I append new items to this instance?
You don't, but you can get the Catalog to return you "raw" sets which can be manipulated or added together with other catalogs. To do this, you must use an external method to call methods of the underlying Catalog object directly. Catalog.py has a function mergeResults which can be used to turn the raw sets into lazy catalog results like usual.
Here is a simple example (external method):
from Products.ZCatalog.Catalog import mergeResults
def queryMultipleCatalogs(request, *zcatalogs): results = [] for zcat in zcatalogs:
results.append(zcat1._catalog.searchResults(request, _merge=0)) sorted = request.has_key('sort-on') or request.has_key('sort_on') reverse = ((request.get('sort-order','') or request.get('sort_order','')).lower()
in ('reverse','descending')) return mergeResults(results, sorted, reverse)
The key is passing _merge=0 to searchResults. It then returns a raw result set. In the case of a sorted set this should be a standard Python list containing three tuples of (sort_key, docid/rid, catalog.__getitem__). These could be manipulated however you like. mergeResults can turn them back into standard catalog results.
To learn more your best bet is to read the Catalog.py sources and use some of the methods in there as I have above.
hth,
-Casey
__________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com
participants (4)
-
Casey Duncan -
Chris Withers -
Gordon Lai -
Paul Winkler