[Zope-dev] Wildcards in TextIndex query. Do they work?

Sat, 23 Jun 2001 22:59:32 +0200

Erik Enge wrote:
> 
> On Wed, 30 May 2001, Erik Enge wrote:
> 
> > I'm going bug hunting...
> 
> I'm back :)
> 
> I think I found the bug.  In lib/python/SearchIndex/GlobbingLexicon.py in
> the query_hook() method.  It seems to say that: "if I can't find a '*' or
> a '?' in the word, then go to else-clause", where the else-clause says
> sodd off.
> 
> Since it iterates over the query, 'word' is actually a list if you use
> parens in your query, and you won't find any wildcards there.  I think.
> 
> Add a dash of recursiveness, and it seems to be solved (for me):
> 
>     def erik_hook(self, q):
>         "doc string"
>         words = []
>         for w in q:
>             if ( (self.multi_wc in w) or
>                  (self.single_wc in w) ):
>                 wids = self.get(w)
>                 for wid in wids:
>                     if words:
>                         words.append(Or)
>                     words.append(wid)
>             else:
>                 words.append(self.erik_hook(w))
>         return words or ['']
> 
>     def query_hook(self, q):
>         """expand wildcards"""
>         words = []
>         for w in q:
>             if ( (self.multi_wc in w) or
>                  (self.single_wc in w) ):
>                 wids = self.get(w)
>                 for wid in wids:
>                     if words:
>                         words.append(Or)
>                     words.append(wid)
>             else:
>                 words.append(self.erik_hook(w))
> 
> Not really tested, but it seems to work.  This might have been resolved in
> CVS, I don't know, should I post it as a bug?

Erik,

I'm afraid that your patch does not solve all the problems you mentioned
in an earlier mail.

You are right that the implementation of query_hook in Zope 2.3.2 and
2.4.0b1 cannot handle words with wildcards in nested lists, but your
patch will lead to endless recursion, if you enter the most simple
query: just one word without wildcards. In this case, "if (
(self.multi_wc in w)..." evaluates to false, hence self.erik_hook is
call for this word, where "if ( (self.multi_wc in w)..." is again false,
and erik_hook is called again...

The statement "q = parse(s)" in UnTextIndex.query (and
PositionIndex.query) before the call to query_hook can return nested
lists, so query_hook must be aware of this.

This can be done with:

        def query_hook(self, q):
            """expand wildcards"""
            words = []
            for w in q:
                if type(w) is type([]):
                    words.append(self.query_hook(w))
                else:
                    if ( (self.multi_wc in w) or
                         (self.single_wc in w) ):
                        wids = self.get(w)
                        for wid in wids:
                            if words:
                                words.append(Or)
                            words.append(wid)
                    else:
                        words.append(w)
            # if words is empty, return something that will make
textindex's
            # __getitem__ return an empty result list
            return words or ['']

You also mentioned the strange results of queries like "eri* and enge".
These are caused by another bug in query_hook:

The results from the wildcard expansion are simply inserted into the
result list. Example: "ab* and xyz" may be expended by query_hook into 

	['aba', 'or', 'abb', 'or', 'abc', 'and', 'xyz']

Since UnTextIndex.evaluate looks first for 'and' operators, this is
eqivalent to 

	['aba', 'or', 'abb', 'or', ['abc', 'and', 'xyz']]

(The funny (or confusing) side effect is that "ab* and xyz" may return
different results compared with "xyz and ab*", because "aba and xyz"
probably gives results different from those for "abc and xyz".)

but we need a result like

	[['aba', 'or', 'abb', 'or', 'abc'], 'and', 'xyz']

This version of query_hook below fixes the problem:

    def query_hook(self, q):
        """expand wildcards"""
        words = []
        for w in q:
            if type(w) is type(''):
                if ( (self.multi_wc in w) or
                     (self.single_wc in w) ):
                    wids = self.get(w)
                    alternatives = []
                    for wid in wids:
                        if alternatives:
                            alternatives.append(Or)
                        alternatives.append(wid)
                    words.append(alternatives or [''])
                else:
                    words.append(w)
            else:
                words.append(self.query_hook(w))
        # if words is empty, return something that will make textindex's
        # __getitem__ return an empty result list
        return words or ['']

You also mentioned the parse result

	['abc', '...', '...', '...', 'def']

which you could not reproduce. Playing with Catalogs, I accidentally
produced the corresponding query: '"abc ... def"', i.e., the double
quotation marks are part of the query string. UnTextIndex.quotes splits
the string between two quotation marks at word boundaries and inserts
'...' between all words:

         for i in range(1,len(splitted),2):
             # split the quoted region into words
             splitted[i] = filter(None, split(splitted[i]))

             # put the Proxmity operator in between quoted words
             for j in range(1, len(splitted[i])):
                 splitted[i][j : j] = [ Near ]

I think that UnTextIndex.quotes should remove all query operators
contained in the search string, before the '...' operator is inserted,
otherwise the search results will be quite fancy. 

Abel