On Thu, 8 Jan 2004 16:24:58 -0500 Jean-Francois.Doyon@CCRS.NRCan.gc.ca wrote:
Casey,
Thanks for pointing out this product, I'll have to give it a try, as I can foresee many useful applications for it !
Cool. Its new and I'm eager to get feedback from the field on it (no pun intended). [...]
Your product seems to have a good base to start with. The problem now, and one that stopped me in my tracks, is how to define/calculate/configure this"weighing" concept. You suggest there's some underlying functionality for weighing already, maybe it'd just be a matter of taking advantage of it, and documenting how to use it ? The big question would be what does a weight of"1" MEAN versus a weight of "2" or "5" ?
ZCTextIndex calculates document and word scores. When queries are performed these scores are combined as intermediate results are combined (using unions and intersections). The weighted versions of these commands allow you to weight one set differently than another. The weight multiplies the score by some factor as the set operation is performed.
The other is how it gets purely implemented. Does the weight need to be known at indexing time, or can it be provided at search time ? My hunch is the weighing should be applied at search time, so your product could be modified to take as input the weights to apply to each index that is being search through ?
Could be done either way. Weighing at index time might be more efficient, but would not allow different weights to be applied for different queries. I doubt that query-time weighting would slow things down at all since it is already being done (only the weight factors are always 1). All of the set operations are implemented in C.
Something like:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"]})
could become:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"], "weights":[5,1]})
Sure or maybe: result = catalog(dc_fields={"query":"Some search string", "weighted_fields":{'title':5, 'description':1}) This might be slightly less error prone (otherwise you need to match up the lists}, if slightly less readable. :record marshalling for weighted_fields could also be supported for queries from web forms. Either spelling would work though and I'm open to input.
Meaning apply a weight of 5 to Title, and 1 to Description. Which I would in turn interpret as meaning Title is 5 times more important than Description (Not knowing any better right now).
Yes, scores for words found in the title would get multiplied by 5. Scores for description would get multiplied by 1.
Personally I'm using the Okapi algorithm. When I started investigating this, I came to the (admitedly uneducated) conclusion that to do proper, fast weighing, then the Okapi implementation would have to be modified to support this feature (Maybe it does already ??), which is over my head, especially with the okascore module being Python/C. Doing it in python would mean doing a second pass over the results that have already been scored once, which is innefficient it seems, and computationally intensive(Especially as I envision th efact that really really nice weighing algorythms would need to have all content in memory in order to do relational work between records).
I don't think the scoring algorithm would be affected what you propose. I'd need to dig in a little deeper to be sure though.
Anyways, that's what I've been thinking about ... But the benefits of having such a beast seem really tentalizing, so I thought I'd ask anyways ... Besides maybe I'm way out to left field on this and it's easier than I make it out to be ?! :)
I think this is a very compelling addition to the product. I'm going to look at implementing it this weekend. Thanks for the idea! -Casey