RE: [Zope] Weighing catalog searches per index ?
Casey, Thanks for pointing out this product, I'll have to give it a try, as I can foresee many useful applications for it ! I'm working on the next generation of our site. Currently we use a regular TextIndex, which is obviously oversimplistic and insufficient. So right now I've been using ZCTextIndex through development, and it seems to give decent results (Hard to tell without getting some mass usage). Problem is some co-workers using different technologies have "weighing" and it sounds like something interesting, at least form the user perspective. Notably, we'd like to maybe experiment with giving the Title more priority over the rest, so that when someone views the search results with the titles, it's perceived as being relevant results. Also, if we have weighing, content could possibly be tweaked/adjusted to take that into account (Notably with Keywords). Your product seems to have a good base to start with. The problem now, and one that stopped me in my tracks, is how to define/calculate/configure this "weighing" concept. You suggest there's some underlying functionality for weighing already, maybe it'd just be a matter of taking advantage of it, and documenting how to use it ? The big question would be what does a weight of "1" MEAN versus a weight of "2" or "5" ? The other is how it gets purely implemented. Does the weight need to be known at indexing time, or can it be provided at search time ? My hunch is the weighing should be applied at search time, so your product could be modified to take as input the weights to apply to each index that is being search through ? Something like: result = catalog(dc_fields={"query":"Some search string", "fields":["Title", "Description"]}) could become: result = catalog(dc_fields={"query":"Some search string", "fields":["Title", "Description"], "weights":[5,1]}) Meaning apply a weight of 5 to Title, and 1 to Description. Which I would in turn interpret as meaning Title is 5 times more important than Description (Not knowing any better right now). Personally I'm using the Okapi algorithm. When I started investigating this, I came to the (admitedly uneducated) conclusion that to do proper, fast weighing, then the Okapi implementation would have to be modified to support this feature (Maybe it does already ??), which is over my head, especially with the okascore module being Python/C. Doing it in python would mean doing a second pass over the results that have already been scored once, which is innefficient it seems, and computationally intensive (Especially as I envision th efact that really really nice weighing algorythms would need to have all content in memory in order to do relational work between records). Anyways, that's what I've been thinking about ... But the benefits of having such a beast seem really tentalizing, so I thought I'd ask anyways ... Besides maybe I'm way out to left field on this and it's easier than I make it out to be ?! :) Thoughts ? Thanks, J.F. -----Original Message----- From: Casey Duncan [mailto:casey@zope.com] Sent: Thursday, January 08, 2004 2:54 PM To: Jean-Francois.Doyon@CCRS.NRCan.gc.ca Cc: zope@zope.org Subject: Re: [Zope] Weighing catalog searches per index ? On Thu, 8 Jan 2004 13:43:43 -0500 Jean-Francois.Doyon@CCRS.NRCan.gc.ca wrote:
Hello,
Does anybody know of a decent implementation of a scoring algorithm that does "weighing" of results, presumably based on the indexes used ?
Low-level support for this already exists via the weightedIntersection and weightedUnion set operations. ZCatalog currently gives all indexes a weight of 1 however.
I'd like to explore the possibility of searching the catalog, but giving results from certain indexes priority over others.
It is possible to implement an index whose results are scored. This is used by TextIndexes to implement relevance ranking for instance. The index just needs to return a mapping (usually a BTree) of rid->score where rid is the record id of the catalog record. ZCatalog automatically adds these scores when intersecting results across indexes.
So in the case of the CMF, saying that if search terms are found in the Title or Descrption, they are more "important" than if they're found somewhere else and so on ...
This might be an iteresting addition to my FieldedTextIndex product. Currently all indexed fields are weighted the same, but it would be straightforward to make this configurable per field.
I know this is a common concept in more advanced search engines (Such as Oracle's InterMedia), but I'm wondering if anyone has done something like this in Zope ...
Let me know what your specific use case is and maybe I'll add it to the FieldedTextIndex product if it fits its usage. -Casey
On Thu, 8 Jan 2004 16:24:58 -0500 Jean-Francois.Doyon@CCRS.NRCan.gc.ca wrote:
Casey,
Thanks for pointing out this product, I'll have to give it a try, as I can foresee many useful applications for it !
Cool. Its new and I'm eager to get feedback from the field on it (no pun intended). [...]
Your product seems to have a good base to start with. The problem now, and one that stopped me in my tracks, is how to define/calculate/configure this"weighing" concept. You suggest there's some underlying functionality for weighing already, maybe it'd just be a matter of taking advantage of it, and documenting how to use it ? The big question would be what does a weight of"1" MEAN versus a weight of "2" or "5" ?
ZCTextIndex calculates document and word scores. When queries are performed these scores are combined as intermediate results are combined (using unions and intersections). The weighted versions of these commands allow you to weight one set differently than another. The weight multiplies the score by some factor as the set operation is performed.
The other is how it gets purely implemented. Does the weight need to be known at indexing time, or can it be provided at search time ? My hunch is the weighing should be applied at search time, so your product could be modified to take as input the weights to apply to each index that is being search through ?
Could be done either way. Weighing at index time might be more efficient, but would not allow different weights to be applied for different queries. I doubt that query-time weighting would slow things down at all since it is already being done (only the weight factors are always 1). All of the set operations are implemented in C.
Something like:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"]})
could become:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"], "weights":[5,1]})
Sure or maybe: result = catalog(dc_fields={"query":"Some search string", "weighted_fields":{'title':5, 'description':1}) This might be slightly less error prone (otherwise you need to match up the lists}, if slightly less readable. :record marshalling for weighted_fields could also be supported for queries from web forms. Either spelling would work though and I'm open to input.
Meaning apply a weight of 5 to Title, and 1 to Description. Which I would in turn interpret as meaning Title is 5 times more important than Description (Not knowing any better right now).
Yes, scores for words found in the title would get multiplied by 5. Scores for description would get multiplied by 1.
Personally I'm using the Okapi algorithm. When I started investigating this, I came to the (admitedly uneducated) conclusion that to do proper, fast weighing, then the Okapi implementation would have to be modified to support this feature (Maybe it does already ??), which is over my head, especially with the okascore module being Python/C. Doing it in python would mean doing a second pass over the results that have already been scored once, which is innefficient it seems, and computationally intensive(Especially as I envision th efact that really really nice weighing algorythms would need to have all content in memory in order to do relational work between records).
I don't think the scoring algorithm would be affected what you propose. I'd need to dig in a little deeper to be sure though.
Anyways, that's what I've been thinking about ... But the benefits of having such a beast seem really tentalizing, so I thought I'd ask anyways ... Besides maybe I'm way out to left field on this and it's easier than I make it out to be ?! :)
I think this is a very compelling addition to the product. I'm going to look at implementing it this weekend. Thanks for the idea! -Casey
participants (2)
-
Casey Duncan -
Jean-Francois.Doyon@CCRS.NRCan.gc.ca