[Zope] Re: relevance ranking in ZCTextIndex or equivalent
Miles Waller
miles at jamkit.com
Fri Jun 2 09:43:04 EDT 2006
Hi,
Thanks for the help. From my investigations, it seems it's not possible
to meet the requirements in a super-straightforward way - a query that
uses several text indexes adds each individual score together, so the
only output available is the total score.
Trying to separate the scores out (for example so it's a tuple
(title_score, description_score, body_text_score) that I can sort on)
looks quite hard - it looks like it would mean changing the indexes to
return the scores in this different format.
My latest approach is to do something like the following (untested):
from BTrees.IIBTree import difference
def specialSearch(words):
# i'm going to manipulate the indexes directly
getIndex = portal_catalog._catalog.getIndex
r1, id1 = getIndex('Title')._apply_index( {'Title':words} )
r2, id2 = getIndex('Description')._apply_index( {'Description':words} )
r3, id3 = getIndex('SearchableText')._apply_index(
{'SearchableText':words} )
# de-dupe this set of results
r3 = difference(r3, r2)
r2 = difference(r2, r1)
# now i have 3 IIBuckets, consisting of (docid, score) tuples
# i sort them into order on score
r1 = r1.byValue(0)
r2 = r2.byValue(0)
r3 = r3.byValue(0)
# concatenate them, preserving the order
res = r1 + r2 + r3
# return something catalog brain-like
return LazyMap(catalog.__getitem__, rs, len(rs))
My debug-prompt tests seem to indicate that this should work. I don't
know if anyone who knows more about lists and btrees can comment if
there's a better way to do the sorting and concatenation of the
different result sets.
Thanks,
Miles
Jonathan wrote:
>
> ----- Original Message ----- From: "Miles Waller"
> <miles-HeBKeAamoVjQT0dZR+AlfA at public.gmane.org>
> To: <zope-CWUwpEBWKX0 at public.gmane.org>
> Sent: Wednesday, May 31, 2006 10:59 AM
> Subject: [Zope] relevance ranking in ZCTextIndex or equivalent
>
>
>> Hi,
>>
>> I'm planning to implement a text search where
>>
>> (match against the title)
>> ranks more highly than
>> (match in the description)
>> ranks more highly than
>> (matches against the body text).
>>
>> Titles and descriptions are short bits of text, so results in these
>> categories can be ranked just by the frequency that the word appears in
>> that part of the text. Matches against the body text should ideally be
>> ranked more like ZCTextIndex rather than plain frequency.
>>
>> My ideas are:
>>
>> - do three separate searches, and then concatenate the result sets
>> together.
>> problem: making sure there are no duplicates in the list without parsing
>> all the results in their entirety.
>>
>> - hijack the 'scoring' part of the index, so those results with matches
>> in the title can have their scores artificially heightened to achieve
>> the ordering i want
>> problem: it's compleletely opaque without a lot of study whether this
>> would achieve what i want. i'd also need to index the items so the
>> index knew what was in the title, which could be a problem.
>>
>> - index title, description and text separately, and then use dieter's
>> AdvancedQuery product to do the query and combine results
>> problem: is it possible to get at the scores when the documents are
>> returned from the index to be able to order them? are the scores
>> returned separately, or will each query overwrite the last one?
>>
>> Has anyone ever tried to do this - or got any pointers - at all?
>
>
> A definitely non-trivial task, but here are some ideas to get you
> pointed in the right (I hope) direction:
>
> Try googling, or looking in the zope source for:
>
> data_record_normalized_score_
> BaseIndex.py
> OkapiIndex.py
> SetOps.py
> okascore.c
>
>
> Good Luck!
>
> Jonathan
> _______________________________________________
> Zope maillist - Zope-CWUwpEBWKX0 at public.gmane.org
> http://mail.zope.org/mailman/listinfo/zope
> ** No cross posts or HTML encoding! **
> (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce
> http://mail.zope.org/mailman/listinfo/zope-dev )
>
More information about the Zope
mailing list