[Zope-dev] ZCatalog/TextIndex: searching for the exact phrase"word1 word2"
Chris McDonough
chrism@digicool.com
Sat, 19 May 2001 20:18:08 -0400
> > I believe it means within 8 words in the current implementation...
>
> So, "word1 NEAR wordlongerthan8characters" wouldn't come up with
> anything? Or is it number of characters inbetween?
8 words. Not characters. But actually I just looked at the source and it's
not even that. It's treated essentially as an AND query, because the
UnTextIndex code doesn't store any proximity information between words.
( I knew this once, but I had forgotten.
So basically, to re-answer the question, it's not possible with the current
text index implementation (please forget the 8-word assertion, I was totally
wrong). The UnTextIndex code would need to be modified to store proximity
information about words in documents that are indexed. This could be done
by keeping a data stucture called a proximity mapping inside the text index
which is basically:
{wordid: {docid: [position, position, ...]}}
When a document was indexed, for example "The band Tool rocks rocks", the
words would be run through the splitter, and it would become (likely)
['band', 'tool', 'rocks', 'rocks']. The document (which is given an id)
would then be stored normally in the text index, and as a part of that
course, each word would be entered into the Vocabulary (lexicon), and would
be assigned a word id. For this example, let's say the words ['band',
'tool', 'rocks'] turned into [42, 45, 78] and that the document id assigned
to the indexed document "The band tool rocks rocks" is 15. As part of the
indexing process, the following stuff would be added to the proximity
dictionary.
{42: {15: [0]}}
{45: {15: [1]}}
{78: {15: [2, 3]}}
(...with 0, 1, 2, and 3 being the index position within documentid 15).
Then later when a phrase was looked up, for example:
"tool rocks rocks"
... the Catalog would treat this as (perhaps) "tool ADJOINEDBY rocks
ADJOINEDBY rocks".
Then perhaps the following happens:
1. The query would search in the normal textindex data structures for
documents ids containing wordids 45 and 78 (it asks the vocabulary to
resolve the wordid). I forget exactly how this works currently, but it
works right now for AND queries, obviously.
2. the query notices that there is an ADJOINEDBY in the query and
subsequently asks the proximity mapping to return the document mappings for
wordids 45 and 78 (the words "tool" and "rocks").
3. Each of these mappings is asked for the list of positions related to the
document ID it got in step 1 (which in our case is 15 only).
4. For each of the returned lists:
[1]
[2,3]
... the code would (beginning at the end of the adjoinedby component of the
query) compare the positions of the last words: starting at "rocks
ADJOINEDBY rocks". It might be code like this:
def adjoinedby(poslist1, poslist2, difference=1):
# in our example, poslist1 = [2,3]
# poslist2 = [2,3]
for x in poslist2:
if (x - difference) in poslist1:
return 1 # will return true.
... assuming we get a match, we move on to the prior word pair in the
adjoinedby list: "tool ADJOINEDBY rocks", and so on. If all the comparisons
return true for the adjoinedby chain, we return the documentid 15 to the
Catalog (the query matched documentid 15). We can also do NEAR searching
this way, by increasing the "difference".
> Hey, I'll be glad to help. If you could show me the ropes I'll hack away
> like, uhm, a hacker!
Well, I know the above is a little hairy, but it's an outline. The first
thing to do is to write a fishbowl proposal explaining why it should be done
and perhaps how it can be done...
- C