RE: [Zope] Weighing catalog searches per index ?
Casey, Ahhh, so it just multiplies the score. Which also means the scoring is applied to each field, instead of merging the fields and THEN scoring. But doesn't that mean that even if not restricting the search to specific fields, the scrow coming out of one of our indexes could be different than a pure ZCTextIndex which scores on just one big "blob" of textual content, instead of several small ones ... At least with Okapi that would presumably make a difference sincepart of the cosring is based on the totoal number of words in the document ? But then that's not too big a deal, so long as whatever differences are explained/documented :) I can help with that if you'd like. As for the syntax of the querying, I'm really indifferent, so long as it works :) I guess your suggestion does have advantages over mine indeed though ! Thanks for getting this done ! Let me know as soon as you've got it and I'll gladly try it out. Since this can be made into a transparent extension of ZCTextIndex, I'd really suggest that if/when this is deemed mature enough, it replace the current ZCTextIndex. This searching fucntionalty is kind of invaluable and extremely powerful, and I'm sure would be of great use to many once they find out about it ! Thanks for the great help! J.F. -----Original Message----- From: Casey Duncan [mailto:casey@zope.com] Sent: Thursday, January 08, 2004 4:54 PM To: Jean-Francois.Doyon@CCRS.NRCan.gc.ca Cc: zope@zope.org Subject: Re: [Zope] Weighing catalog searches per index ? On Thu, 8 Jan 2004 16:24:58 -0500 Jean-Francois.Doyon@CCRS.NRCan.gc.ca wrote:
Casey,
Thanks for pointing out this product, I'll have to give it a try, as I can foresee many useful applications for it !
Cool. Its new and I'm eager to get feedback from the field on it (no pun intended). [...]
Your product seems to have a good base to start with. The problem now, and one that stopped me in my tracks, is how to define/calculate/configure this"weighing" concept. You suggest there's some underlying functionality for weighing already, maybe it'd just be a matter of taking advantage of it, and documenting how to use it ? The big question would be what does a weight of"1" MEAN versus a weight of "2" or "5" ?
ZCTextIndex calculates document and word scores. When queries are performed these scores are combined as intermediate results are combined (using unions and intersections). The weighted versions of these commands allow you to weight one set differently than another. The weight multiplies the score by some factor as the set operation is performed.
The other is how it gets purely implemented. Does the weight need to be known at indexing time, or can it be provided at search time ? My hunch is the weighing should be applied at search time, so your product could be modified to take as input the weights to apply to each index that is being search through ?
Could be done either way. Weighing at index time might be more efficient, but would not allow different weights to be applied for different queries. I doubt that query-time weighting would slow things down at all since it is already being done (only the weight factors are always 1). All of the set operations are implemented in C.
Something like:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"]})
could become:
result = catalog(dc_fields={"query":"Some search string", "fields":["Title","Description"], "weights":[5,1]})
Sure or maybe: result = catalog(dc_fields={"query":"Some search string", "weighted_fields":{'title':5, 'description':1}) This might be slightly less error prone (otherwise you need to match up the lists}, if slightly less readable. :record marshalling for weighted_fields could also be supported for queries from web forms. Either spelling would work though and I'm open to input.
Meaning apply a weight of 5 to Title, and 1 to Description. Which I would in turn interpret as meaning Title is 5 times more important than Description (Not knowing any better right now).
Yes, scores for words found in the title would get multiplied by 5. Scores for description would get multiplied by 1.
Personally I'm using the Okapi algorithm. When I started investigating this, I came to the (admitedly uneducated) conclusion that to do proper, fast weighing, then the Okapi implementation would have to be modified to support this feature (Maybe it does already ??), which is over my head, especially with the okascore module being Python/C. Doing it in python would mean doing a second pass over the results that have already been scored once, which is innefficient it seems, and computationally intensive(Especially as I envision th efact that really really nice weighing algorythms would need to have all content in memory in order to do relational work between records).
I don't think the scoring algorithm would be affected what you propose. I'd need to dig in a little deeper to be sure though.
Anyways, that's what I've been thinking about ... But the benefits of having such a beast seem really tentalizing, so I thought I'd ask anyways ... Besides maybe I'm way out to left field on this and it's easier than I make it out to be ?! :)
I think this is a very compelling addition to the product. I'm going to look at implementing it this weekend. Thanks for the idea! -Casey
On Fri, 9 Jan 2004 18:00:11 -0500 Jean-Francois.Doyon@CCRS.NRCan.gc.ca wrote:
Casey,
Ahhh, so it just multiplies the score. Which also means the scoring is applied to each field, instead of merging the fields and THEN scoring. But doesn't that mean that even if not restricting the search to specific fields, the scrow coming out of one of our indexes could be different than a pure ZCTextIndex which scores on just one big "blob" of textual content, instead of several small ones ... At least with Okapi that would presumably make a difference sincepart of the cosring is based on the totoal number of words in the document ?
FieldedTextIndex actually still stores the word=>doc=>score mapping the same way ZCTextIndex does. It keeps a separate word=>field=>doc mapping (unscored). When you do a search without selecting any fields it only uses the first mapping, so the scores work out the same. In fact the amount of work is exactly the same. When you do specify fields it intersects the scored results from all documents with a union of documents found for each selected field. This intersection does not affect the scores presently however. This will be the place I add in the weighting per field (currently they all have a weight of 1).
As for the syntax of the querying, I'm really indifferent, so long as it works :) I guess your suggestion does have advantages over mine indeed though !
Yeah, its just easier to see which scores go with which field.
Thanks for getting this done ! Let me know as soon as you've got it and I'll gladly try it out.
Sure. I'm glad to have victims^H^H^H^H^H^H^Husers to try it out on ;^)
Since this can be made into a transparent extension of ZCTextIndex, I'd really suggest that if/when this is deemed mature enough, it replace the current ZCTextIndex. This searching fucntionalty is kind of invaluable and extremely powerful, and I'm sure would be of great use to many once they find out about it !
If it is generally deemed useful and enough people use it, I would definitely propose putting it in the Zope core. For now I'm happy to shake out the details as a separately distributed product. I don't think it will be able to fully replace ZCTextIndex though, mainly because the input data structure is different (a dict vs. a string or list of strings). Most applications (like CMF) define SearchableText to return a string. That is the way all TextIndexes have worked up til now. OTOH it would not be out of the question to make FieldedTextIndex understand a simple string input (and store as a single field named "SearchableText" or "body"). Another, perhaps less compelling argument to not replacing ZCTextIndex wholesale is that FieldedTextIndex is a more expensive data structure when you only need a single text blob indexed. That quickly changes though when you start replacing a bunch of ZCTextIndexes with a single FieldedTextIndex though. -Casey
participants (2)
-
Casey Duncan -
Jean-Francois.Doyon@CCRS.NRCan.gc.ca