ZCatalog/TextIndex: searching for the exact phrase "word1 word2"
Hiya, it basically says it in the subject. How can I search for "word1 word2" without ZCatalog/TextIndex interpreting it as "word1" or/and "word2"?
You should be able to do this with quotes around the words, but that feature is sort of only half-wired-up at this point (it was *never* implemented... when I investigated this, I was amazed to see that there were hooks for it and everything within the query code, some marshalling code, some other stuff... and guess what it all led to... a method that didn't exist, and no actual way to do phrase searching!). Currently, quotes around word do the same thing as parens around words (word1 NEAR word2). Sigh. - C ----- Original Message ----- From: "Erik Enge" <erik@thingamy.net> To: <zope-dev@zope.org> Sent: Friday, May 18, 2001 12:25 PM Subject: [Zope-dev] ZCatalog/TextIndex: searching for the exact phrase "word1 word2"
Hiya,
it basically says it in the subject. How can I search for "word1 word2" without ZCatalog/TextIndex interpreting it as "word1" or/and "word2"?
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
On Fri, 18 May 2001, Chris McDonough wrote:
You should be able to do this with quotes around the words, but that feature is sort of only half-wired-up at this point. [snip] Currently, quotes around word do the same thing as parens around words (word1 NEAR word2). Sigh.
What does NEAR mean, then? How near is NEAR? Is it in line for 2.4?
Erik Enge wrote:
What does NEAR mean, then? How near is NEAR?
I believe it means within 8 words in the current implementation...
Is it in line for 2.4?
No, unfortunately. I'm not sure when it will be on the map. This is an area where someone outside of DC sufficiently motivated to make it work could help. - C
On Sat, 19 May 2001, Chris McDonough wrote:
I believe it means within 8 words in the current implementation...
So, "word1 NEAR wordlongerthan8characters" wouldn't come up with anything? Or is it number of characters inbetween?
Is it in line for 2.4?
No, unfortunately. I'm not sure when it will be on the map. This is an area where someone outside of DC sufficiently motivated to make it work could help.
Hey, I'll be glad to help. If you could show me the ropes I'll hack away like, uhm, a hacker!
I believe it means within 8 words in the current implementation...
So, "word1 NEAR wordlongerthan8characters" wouldn't come up with anything? Or is it number of characters inbetween?
8 words. Not characters. But actually I just looked at the source and it's not even that. It's treated essentially as an AND query, because the UnTextIndex code doesn't store any proximity information between words. ( I knew this once, but I had forgotten. So basically, to re-answer the question, it's not possible with the current text index implementation (please forget the 8-word assertion, I was totally wrong). The UnTextIndex code would need to be modified to store proximity information about words in documents that are indexed. This could be done by keeping a data stucture called a proximity mapping inside the text index which is basically: {wordid: {docid: [position, position, ...]}} When a document was indexed, for example "The band Tool rocks rocks", the words would be run through the splitter, and it would become (likely) ['band', 'tool', 'rocks', 'rocks']. The document (which is given an id) would then be stored normally in the text index, and as a part of that course, each word would be entered into the Vocabulary (lexicon), and would be assigned a word id. For this example, let's say the words ['band', 'tool', 'rocks'] turned into [42, 45, 78] and that the document id assigned to the indexed document "The band tool rocks rocks" is 15. As part of the indexing process, the following stuff would be added to the proximity dictionary. {42: {15: [0]}} {45: {15: [1]}} {78: {15: [2, 3]}} (...with 0, 1, 2, and 3 being the index position within documentid 15). Then later when a phrase was looked up, for example: "tool rocks rocks" ... the Catalog would treat this as (perhaps) "tool ADJOINEDBY rocks ADJOINEDBY rocks". Then perhaps the following happens: 1. The query would search in the normal textindex data structures for documents ids containing wordids 45 and 78 (it asks the vocabulary to resolve the wordid). I forget exactly how this works currently, but it works right now for AND queries, obviously. 2. the query notices that there is an ADJOINEDBY in the query and subsequently asks the proximity mapping to return the document mappings for wordids 45 and 78 (the words "tool" and "rocks"). 3. Each of these mappings is asked for the list of positions related to the document ID it got in step 1 (which in our case is 15 only). 4. For each of the returned lists: [1] [2,3] ... the code would (beginning at the end of the adjoinedby component of the query) compare the positions of the last words: starting at "rocks ADJOINEDBY rocks". It might be code like this: def adjoinedby(poslist1, poslist2, difference=1): # in our example, poslist1 = [2,3] # poslist2 = [2,3] for x in poslist2: if (x - difference) in poslist1: return 1 # will return true. ... assuming we get a match, we move on to the prior word pair in the adjoinedby list: "tool ADJOINEDBY rocks", and so on. If all the comparisons return true for the adjoinedby chain, we return the documentid 15 to the Catalog (the query matched documentid 15). We can also do NEAR searching this way, by increasing the "difference".
Hey, I'll be glad to help. If you could show me the ropes I'll hack away like, uhm, a hacker!
Well, I know the above is a little hairy, but it's an outline. The first thing to do is to write a fishbowl proposal explaining why it should be done and perhaps how it can be done... - C
On Sat, 19 May 2001, Chris McDonough wrote:
8 words. Not characters. But actually I just looked at the source and it's not even that. It's treated essentially as an AND query, because the UnTextIndex code doesn't store any proximity information between words. ( I knew this once, but I had forgotten.
I've done alot of practical testing with it now, and it seems that the exact phrase query might be overrated. So far, the already-available query types have sufficed. I'll let the client to even more testing, but if they don't really need it, I can't justify spending time on it. Obviously :)
Well, I know the above is a little hairy, but it's an outline. The first thing to do is to write a fishbowl proposal explaining why it should be done and perhaps how it can be done...
As I said, if the client blahblablbalbah, then I'll try a fishbowl proposal. Thanks, anyway, though :-).
I've done alot of practical testing with it now, and it seems that the exact phrase query might be overrated. So far, the already-available query types have sufficed. I'll let the client to even more testing, but if they don't really need it, I can't justify spending time on it. Obviously :)
OK, no problem... This is our excuse too. ;-)
participants (2)
-
Chris McDonough -
Erik Enge