relevance ranking in ZCTextIndex or equivalent
Hi, I'm planning to implement a text search where (match against the title) ranks more highly than (match in the description) ranks more highly than (matches against the body text). Titles and descriptions are short bits of text, so results in these categories can be ranked just by the frequency that the word appears in that part of the text. Matches against the body text should ideally be ranked more like ZCTextIndex rather than plain frequency. My ideas are: - do three separate searches, and then concatenate the result sets together. problem: making sure there are no duplicates in the list without parsing all the results in their entirety. - hijack the 'scoring' part of the index, so those results with matches in the title can have their scores artificially heightened to achieve the ordering i want problem: it's compleletely opaque without a lot of study whether this would achieve what i want. i'd also need to index the items so the index knew what was in the title, which could be a problem. - index title, description and text separately, and then use dieter's AdvancedQuery product to do the query and combine results problem: is it possible to get at the scores when the documents are returned from the index to be able to order them? are the scores returned separately, or will each query overwrite the last one? Has anyone ever tried to do this - or got any pointers - at all? Thanks in advance, Miles
----- Original Message ----- From: "Miles Waller" <miles@jamkit.com> To: <zope@zope.org> Sent: Wednesday, May 31, 2006 10:59 AM Subject: [Zope] relevance ranking in ZCTextIndex or equivalent
Hi,
I'm planning to implement a text search where
(match against the title) ranks more highly than (match in the description) ranks more highly than (matches against the body text).
Titles and descriptions are short bits of text, so results in these categories can be ranked just by the frequency that the word appears in that part of the text. Matches against the body text should ideally be ranked more like ZCTextIndex rather than plain frequency.
My ideas are:
- do three separate searches, and then concatenate the result sets together. problem: making sure there are no duplicates in the list without parsing all the results in their entirety.
- hijack the 'scoring' part of the index, so those results with matches in the title can have their scores artificially heightened to achieve the ordering i want problem: it's compleletely opaque without a lot of study whether this would achieve what i want. i'd also need to index the items so the index knew what was in the title, which could be a problem.
- index title, description and text separately, and then use dieter's AdvancedQuery product to do the query and combine results problem: is it possible to get at the scores when the documents are returned from the index to be able to order them? are the scores returned separately, or will each query overwrite the last one?
Has anyone ever tried to do this - or got any pointers - at all?
A definitely non-trivial task, but here are some ideas to get you pointed in the right (I hope) direction: Try googling, or looking in the zope source for: data_record_normalized_score_ BaseIndex.py OkapiIndex.py SetOps.py okascore.c Good Luck! Jonathan
Hi, Thanks for the help. From my investigations, it seems it's not possible to meet the requirements in a super-straightforward way - a query that uses several text indexes adds each individual score together, so the only output available is the total score. Trying to separate the scores out (for example so it's a tuple (title_score, description_score, body_text_score) that I can sort on) looks quite hard - it looks like it would mean changing the indexes to return the scores in this different format. My latest approach is to do something like the following (untested): from BTrees.IIBTree import difference def specialSearch(words): # i'm going to manipulate the indexes directly getIndex = portal_catalog._catalog.getIndex r1, id1 = getIndex('Title')._apply_index( {'Title':words} ) r2, id2 = getIndex('Description')._apply_index( {'Description':words} ) r3, id3 = getIndex('SearchableText')._apply_index( {'SearchableText':words} ) # de-dupe this set of results r3 = difference(r3, r2) r2 = difference(r2, r1) # now i have 3 IIBuckets, consisting of (docid, score) tuples # i sort them into order on score r1 = r1.byValue(0) r2 = r2.byValue(0) r3 = r3.byValue(0) # concatenate them, preserving the order res = r1 + r2 + r3 # return something catalog brain-like return LazyMap(catalog.__getitem__, rs, len(rs)) My debug-prompt tests seem to indicate that this should work. I don't know if anyone who knows more about lists and btrees can comment if there's a better way to do the sorting and concatenation of the different result sets. Thanks, Miles Jonathan wrote:
----- Original Message ----- From: "Miles Waller" <miles-HeBKeAamoVjQT0dZR+AlfA@public.gmane.org> To: <zope-CWUwpEBWKX0@public.gmane.org> Sent: Wednesday, May 31, 2006 10:59 AM Subject: [Zope] relevance ranking in ZCTextIndex or equivalent
Hi,
I'm planning to implement a text search where
(match against the title) ranks more highly than (match in the description) ranks more highly than (matches against the body text).
Titles and descriptions are short bits of text, so results in these categories can be ranked just by the frequency that the word appears in that part of the text. Matches against the body text should ideally be ranked more like ZCTextIndex rather than plain frequency.
My ideas are:
- do three separate searches, and then concatenate the result sets together. problem: making sure there are no duplicates in the list without parsing all the results in their entirety.
- hijack the 'scoring' part of the index, so those results with matches in the title can have their scores artificially heightened to achieve the ordering i want problem: it's compleletely opaque without a lot of study whether this would achieve what i want. i'd also need to index the items so the index knew what was in the title, which could be a problem.
- index title, description and text separately, and then use dieter's AdvancedQuery product to do the query and combine results problem: is it possible to get at the scores when the documents are returned from the index to be able to order them? are the scores returned separately, or will each query overwrite the last one?
Has anyone ever tried to do this - or got any pointers - at all?
A definitely non-trivial task, but here are some ideas to get you pointed in the right (I hope) direction:
Try googling, or looking in the zope source for:
data_record_normalized_score_ BaseIndex.py OkapiIndex.py SetOps.py okascore.c
Good Luck!
Jonathan _______________________________________________ Zope maillist - Zope-CWUwpEBWKX0@public.gmane.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Miles Waller wrote at 2006-5-31 15:59 +0100:
I'm planning to implement a text search where
(match against the title) ranks more highly than (match in the description) ranks more highly than (matches against the body text).
If you are lucky, "AdvancedQuery" will soon support ranking: For efficiency reasons, it will not use term frequencies -- thus, you will not have the form of ranking you know now from ZCTextIndex. Instead, the rank will be determined by evaluating which queries are fullfilled by the document. It will look like: evalAdvancedQuery(q, rank=((v_1,q_1), (v_2,q_1), ... (v_n,q_n))) The rank of document "d" with be the sum of the "v_i" for with "d" matches "q_i" divided by the sum of all "v_i". -- Dieter
participants (3)
-
Dieter Maurer -
Jonathan -
Miles Waller