Wildcards in TextIndex query. Do they work?

older
[BUG+PATCH] for OFS/CopySupport.py...

Erik Enge

24 May 2001 24 May '01

10:09 a.m.

Hi, is it me, or is this just not working: (word1 or word*) and (wor?3) ie. wildcards in TextIndex queries. I can't seem to make it work, and I'm not able to track down where it stops working. Should it work in the first place? Zope 2.3.2 Thanks.

Show replies by date

Casey Duncan

24 May 24 May

4:56 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

Erik Enge wrote:

...

Hi,

is it me, or is this just not working:

(word1 or word*) and (wor?3)

ie. wildcards in TextIndex queries. I can't seem to make it work, and I'm not able to track down where it stops working. Should it work in the first place?

Zope 2.3.2

Thanks.

Works great for me. Perhaps you are using a Vocabulary that has Globbing turned off? -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Erik Enge

7:40 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Casey Duncan wrote:

...

Works great for me. Perhaps you are using a Vocabulary that has Globbing turned off?

I'm not sure, how do I check? This query works: wil?car* This doesn't: (wil?car* or something else) and (word1 and word2) I can't see that the query-parsers in UnTextIndex.py transforms them differently, but I might be missing something obvious.

Casey Duncan

7:40 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

Erik Enge wrote:

...

On Thu, 24 May 2001, Casey Duncan wrote:

...
Works great for me. Perhaps you are using a Vocabulary that has Globbing turned off?

I'm not sure, how do I check?

This query works:

wil?car*

This doesn't:

(wil?car* or something else) and (word1 and word2)

I'm not sure how well grouping with parens is supported right now. I know phrase matching isn't supported very well.

...

I can't see that the query-parsers in UnTextIndex.py transforms them differently, but I might be missing something obvious.

-- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>

Michel Pelletier

8:26 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Erik Enge wrote:

...

This query works:

wil?car*

This doesn't:

(wil?car* or something else) and (word1 and word2)

If the first works, then you are using a globbing vocabulary. The second one should work, but maybe there is a bug. Or perhaps your search criteria is so strict that you are getting no results.

...

I can't see that the query-parsers in UnTextIndex.py transforms them differently, but I might be missing something obvious.

There's _nothing_ obvious in that particular chunk of code. -Michel

Erik Enge

9:18 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Michel Pelletier wrote:

...

If the first works, then you are using a globbing vocabulary. The second one should work, but maybe there is a bug. Or perhaps your search criteria is so strict that you are getting no results.

Hm. Something isn't right here. This: eric got me 70 hits. This: eri? got me 4 hits. That's a bit strange, if you ask me :) This: (erik) and (enge) returned 1 hit This: (erik) and (eng?) gave me none. The first one looked like this after the parsers had nibbled on it: [['erik'], 'and', ['enge']] And the latter one: [['erik'], 'and', ['eng?']] This: (erik ... enge) turned in to: [['erik', '...', 'enge']] and returned one (correct) result. Although, I recall seeing something like this: [['erik', '...', '...', '...', 'enge']] earlier, which probably should never make sense. Something is wrong here. Where should I look next to figure out what's going on?

...

...
I can't see that the query-parsers in UnTextIndex.py transforms them differently, but I might be missing something obvious.

There's _nothing_ obvious in that particular chunk of code.

Good, then it's just not me. Is the overall design philosophy for ZCatalog/Catalog/SearchIndex documented anywhere? (By the way, from lib/python/SearchIndex/TextIndex.py, what is sws and cv3?)

Christian Robottom Reis

9:20 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Erik Enge wrote:

...

Good, then it's just not me. Is the overall design philosophy for ZCatalog/Catalog/SearchIndex documented anywhere? (By the way, from lib/python/SearchIndex/TextIndex.py, what is sws and cv3?)

I'm trying to get a knot of knowledge into my head by studying the SearchIndex modules. I'm writing more or less what's happening out on http://wiki.async.com.br/index.php?SearchIndex but I'm not moving so fast, as the code isn't too trivial. If you want to stop by and help, by all means. :-) Take care, -- /\/\ Christian Reis, Senior Engineer, Async Open Source, Brazil ~\/~ http://async.com.br/~kiko/ | [+55 16] 274 4311

Michel Pelletier

9:37 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Erik Enge wrote:

...

On Thu, 24 May 2001, Michel Pelletier wrote:

...
If the first works, then you are using a globbing vocabulary. The second one should work, but maybe there is a bug. Or perhaps your search criteria is so strict that you are getting no results.

Hm. Something isn't right here.

I don't think you are using a globbing vocabulary.

...

This:

eric

got me 70 hits.

This:

eri?

got me 4 hits.

If you are not using a glob vocab, I suspect it stripped out the ? and is hitting on 'eri'. Do you have that word anywhere?

...

That's a bit strange, if you ask me :)

This:

(erik) and (enge)

returned 1 hit

This:

(erik) and (eng?)

gave me none.

Which could make sense if you were not using a glob vocab.

...

The first one looked like this after the parsers had nibbled on it:

[['erik'], 'and', ['enge']]

And the latter one:

[['erik'], 'and', ['eng?']]

This one should look like [['erik'], 'and', ['enge', 'engs', 'engf', ...]] and match all the words that match the pattern eng?. If this isn't being expanded, then you are not using a globbing vocabulary. Then again, where did you get these objects? If you were looking at the wrong point in the code, the wildcards may not have been expanded yet.

...

[['erik', '...', '...', '...', 'enge']]

Where do you see this?

...

Where should I look next to figure out what's going on?

Make sure you are using a globbing vocab. Note that you can't change a catalog's vocabulary once the catalog is made, so you have to make a new catalog.

...

...
...
I can't see that the query-parsers in UnTextIndex.py transforms them differently, but I might be missing something obvious.

There's _nothing_ obvious in that particular chunk of code.

Good, then it's just not me. Is the overall design philosophy for ZCatalog/Catalog/SearchIndex documented anywhere?

The catalog has evolved over the past four years. Most of the text index query parser code was written by someone long gone from this company, and certainly way before my time. The catalog is, in fact, the evolution of a completely different product called ZTables, now long dead in the annals of Principia history. This person did not document their design, so the answer is no. I had some UML models once, but my modeling tool ate them.

...

(By the way, from lib/python/SearchIndex/TextIndex.py, what is sws and cv3?)

Very old consulting projects, looooong dead. -Michel

Erik Enge

25 May 25 May

8:39 a.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Michel Pelletier wrote:

...

I don't think you are using a globbing vocabulary.

But globbing works for other queries. In the same catalog.

...

If you are not using a glob vocab, I suspect it stripped out the ? and is hitting on 'eri'. Do you have that word anywhere?

I tried searching for: eri and that gave me four results. No globbing, then?

...

Then again, where did you get these objects? If you were looking at the wrong point in the code, the wildcards may not have been expanded yet.

Could be it...

...

...
[['erik', '...', '...', '...', 'enge']]

Where do you see this?

I can't reproduce it right now, I'll let you know if I see it again.

...

Make sure you are using a globbing vocab. Note that you can't change a catalog's vocabulary once the catalog is made, so you have to make a new catalog.

How can I change it for a new one?

Erik Enge

29 May 29 May

1:44 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Thu, 24 May 2001, Michel Pelletier wrote:

...

I don't think you are using a globbing vocabulary.

I think I am:

...

...
...
print_info(applic.Catalog(word='scripto*')) unsplitted ['scripto*'] unl: ['scripto*'] unq: [104623, 'or', 112198, 'or', 151568] Length: 6 Content: [<mybrains instance at 1226d358>, <mybrains instance at 127bb540>, <mybrains instance at 12bd8138>, <mybrains instance at 127bb658>, <mybrains instance at 1226c620>, <mybrains instance at 12092eb0>] print_info(applic.Catalog(word='(scripto*)')) unsplitted [] unsplitted ['scripto*'] unl: ['scripto*'] unsplitted [] unl: [['scripto*']] unq: ['scripto*'] unq: [['scripto*']] Length: 0 Content: []

the unsplitted, unl and unq are my debug flags, but you can see what happens: without parens the '*' has it's desired effect, with, it doesn't. Got a clue? Is this my bug, or ZCatalog's?

Michel Pelletier

5:16 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Tue, 29 May 2001, Erik Enge wrote:

...

On Thu, 24 May 2001, Michel Pelletier wrote:

the unsplitted, unl and unq are my debug flags, but you can see what happens: without parens the '*' has it's desired effect, with, it doesn't.

Got a clue? Is this my bug, or ZCatalog's?

Must be ZCatalog's. I'm guessing the paren matching takes a different code path that doesn't expand wildcards. -Michel

Erik Enge

30 May 30 May

8:18 a.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Tue, 29 May 2001, Michel Pelletier wrote:

...

Must be ZCatalog's. I'm guessing the paren matching takes a different code path that doesn't expand wildcards.

I'm going bug hunting... If I'm not back in five minutes... just wait longer. (Ace Ventura)

Erik Enge

9 a.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Wed, 30 May 2001, Erik Enge wrote:

...

I'm going bug hunting...

I'm back :) I think I found the bug. In lib/python/SearchIndex/GlobbingLexicon.py in the query_hook() method. It seems to say that: "if I can't find a '*' or a '?' in the word, then go to else-clause", where the else-clause says sodd off. Since it iterates over the query, 'word' is actually a list if you use parens in your query, and you won't find any wildcards there. I think. Add a dash of recursiveness, and it seems to be solved (for me): def erik_hook(self, q): "doc string" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w)) return words or [''] def query_hook(self, q): """expand wildcards""" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w)) Not really tested, but it seems to work. This might have been resolved in CVS, I don't know, should I post it as a bug?

Chris McDonough

9:16 a.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

Thanks for tracking this down... If you're so inclined, please put this in the Collector (with a description of the problem, as well as a way to reproduce it, the patch alone isn't nearly as helpful) so it doesn't get dropped on the floor. I doubt very much that it's fixed in CVS. - C Erik Enge wrote:

...

On Wed, 30 May 2001, Erik Enge wrote:

...
I'm going bug hunting...

I'm back :)

I think I found the bug. In lib/python/SearchIndex/GlobbingLexicon.py in the query_hook() method. It seems to say that: "if I can't find a '*' or a '?' in the word, then go to else-clause", where the else-clause says sodd off.

Since it iterates over the query, 'word' is actually a list if you use parens in your query, and you won't find any wildcards there. I think.

Add a dash of recursiveness, and it seems to be solved (for me):

def erik_hook(self, q): "doc string" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w)) return words or ['']

def query_hook(self, q): """expand wildcards""" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w))

Not really tested, but it seems to work. This might have been resolved in CVS, I don't know, should I post it as a bug?

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

Erik Enge

10 a.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

On Wed, 30 May 2001, Chris McDonough wrote:

...

Thanks for tracking this down...

No worries :)

...

If you're so inclined, please put this in the Collector [...] so it doesn't get dropped on the floor.

Done: <URL:http://classic.zope.org:8080/Collector/2262/view/>

Andre Schubert

1:31 p.m.

New subject: [Zope-dev] Browser Timeout

Hi all, I have a very important problem with Zope. I want to display a html-site with diagramms, the diagrams are from a sql-db with over 30Mill. entries. My problem is that when i call the site, then for each diagram to display a sql-query is executed which costs time. When 20 or 30 query are executed, then the browser say that the document contains no data. Is it right, the the browser send a request and got the response when the site is completly rendered( all queries executed ) ? If yes, how can i directly write to the client. First all headers, and after every query send the data to the client, if i can do this, then there will be no timeout. thanks as

R. David Murray

5:35 p.m.

New subject: [Zope-dev] Browser Timeout

On Wed, 30 May 2001, Andre Schubert wrote:

...

Is it right, the the browser send a request and got the response when the site is completly rendered( all queries executed ) ? If yes, how can i directly write to the client. First all headers, and after every query send the data to the client, if i can do this, then there will be no timeout.

RESPONSE.write --RDM

Andre Schubert

31 May 31 May

7:14 a.m.

New subject: [Zope-dev] Browser Timeout

Hi, If have tested RESPONSE.write with the following function def test(self,REQUEST=None,RESPONSE=None): """ Test RESPONSE.write""" RESPONSE.setStatus('200') RESPONSE.setHeader('Content-Type','text/html') RESPONSE.write('<html>') **** Here is the body-processing wich takes several time ***** RESPONSE.write('</html>') I tested with lynx. If i type http://somewhere.com/foo/test i got no response because timeout, this means, that RESPONSE.setStatus and the first RESPONSE.write are sent back to the client if the body processing is done, but i would send every command as it is processed back to the client. Or is it my Zope( 2.2.4 ) on Immunix 6.2 RedHat. as "R. David Murray" schrieb:

...

On Wed, 30 May 2001, Andre Schubert wrote:

...
Is it right, the the browser send a request and got the response when the site is completly rendered( all queries executed ) ? If yes, how can i directly write to the client. First all headers, and after every query send the data to the client, if i can do this, then there will be no timeout.

RESPONSE.write

--RDM

_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )

R. David Murray

1 Jun 1 Jun

3:34 a.m.

New subject: [Zope-dev] Browser Timeout

On Thu, 31 May 2001, Andre Schubert wrote:

...

I tested with lynx. If i type http://somewhere.com/foo/test i got no response because timeout, this means, that RESPONSE.setStatus and the first RESPONSE.write are sent back to the client if the body processing is done, but i would send every command as it is processed back to the client. Or is it my Zope( 2.2.4 ) on Immunix 6.2 RedHat.

I presume you tested it first without the large processing to make sure the method was otherwise working. I haven't used RESPONSE.write myself. I know that others on this list have, so hopefully someone will chime in with a working example or a debuggin suggestion. Of course, it's always possible that streaming got broken at some point; I'm not sure that it gets used by very many people so breakage may take a while to get noticed... --RDM

Andre Schubert

7:50 a.m.

New subject: [Zope-dev] Browser Timeout

"R. David Murray" schrieb:

...

On Thu, 31 May 2001, Andre Schubert wrote:

...
I tested with lynx. If i type http://somewhere.com/foo/test i got no response because timeout, this means, that RESPONSE.setStatus and the first RESPONSE.write are sent back to the client if the body processing is done, but i would send every command as it is processed back to the client. Or is it my Zope( 2.2.4 ) on Immunix 6.2 RedHat.

I presume you tested it first without the large processing to make sure the method was otherwise working.

Yes, i've tested it without large processing and it works fine. as

...

I haven't used RESPONSE.write myself. I know that others on this list have, so hopefully someone will chime in with a working example or a debuggin suggestion. Of course, it's always possible that streaming got broken at some point; I'm not sure that it gets used by very many people so breakage may take a while to get noticed...

--RDM

abel deuring

23 Jun 23 Jun

8:59 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

Erik Enge wrote:

...

On Wed, 30 May 2001, Erik Enge wrote:

...
I'm going bug hunting...

I'm back :)

I think I found the bug. In lib/python/SearchIndex/GlobbingLexicon.py in the query_hook() method. It seems to say that: "if I can't find a '*' or a '?' in the word, then go to else-clause", where the else-clause says sodd off.

Since it iterates over the query, 'word' is actually a list if you use parens in your query, and you won't find any wildcards there. I think.

Add a dash of recursiveness, and it seems to be solved (for me):

def erik_hook(self, q): "doc string" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w)) return words or ['']

def query_hook(self, q): """expand wildcards""" words = [] for w in q: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(self.erik_hook(w))

Not really tested, but it seems to work. This might have been resolved in CVS, I don't know, should I post it as a bug?

Erik, I'm afraid that your patch does not solve all the problems you mentioned in an earlier mail. You are right that the implementation of query_hook in Zope 2.3.2 and 2.4.0b1 cannot handle words with wildcards in nested lists, but your patch will lead to endless recursion, if you enter the most simple query: just one word without wildcards. In this case, "if ( (self.multi_wc in w)..." evaluates to false, hence self.erik_hook is call for this word, where "if ( (self.multi_wc in w)..." is again false, and erik_hook is called again... The statement "q = parse(s)" in UnTextIndex.query (and PositionIndex.query) before the call to query_hook can return nested lists, so query_hook must be aware of this. This can be done with: def query_hook(self, q): """expand wildcards""" words = [] for w in q: if type(w) is type([]): words.append(self.query_hook(w)) else: if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) for wid in wids: if words: words.append(Or) words.append(wid) else: words.append(w) # if words is empty, return something that will make textindex's # __getitem__ return an empty result list return words or [''] You also mentioned the strange results of queries like "eri* and enge". These are caused by another bug in query_hook: The results from the wildcard expansion are simply inserted into the result list. Example: "ab* and xyz" may be expended by query_hook into ['aba', 'or', 'abb', 'or', 'abc', 'and', 'xyz'] Since UnTextIndex.evaluate looks first for 'and' operators, this is eqivalent to ['aba', 'or', 'abb', 'or', ['abc', 'and', 'xyz']] (The funny (or confusing) side effect is that "ab* and xyz" may return different results compared with "xyz and ab*", because "aba and xyz" probably gives results different from those for "abc and xyz".) but we need a result like [['aba', 'or', 'abb', 'or', 'abc'], 'and', 'xyz'] This version of query_hook below fixes the problem: def query_hook(self, q): """expand wildcards""" words = [] for w in q: if type(w) is type(''): if ( (self.multi_wc in w) or (self.single_wc in w) ): wids = self.get(w) alternatives = [] for wid in wids: if alternatives: alternatives.append(Or) alternatives.append(wid) words.append(alternatives or ['']) else: words.append(w) else: words.append(self.query_hook(w)) # if words is empty, return something that will make textindex's # __getitem__ return an empty result list return words or [''] You also mentioned the parse result ['abc', '...', '...', '...', 'def'] which you could not reproduce. Playing with Catalogs, I accidentally produced the corresponding query: '"abc ... def"', i.e., the double quotation marks are part of the query string. UnTextIndex.quotes splits the string between two quotation marks at word boundaries and inserts '...' between all words: for i in range(1,len(splitted),2): # split the quoted region into words splitted[i] = filter(None, split(splitted[i])) # put the Proxmity operator in between quoted words for j in range(1, len(splitted[i])): splitted[i][j : j] = [ Near ] I think that UnTextIndex.quotes should remove all query operators contained in the search string, before the '...' operator is inserted, otherwise the search results will be quite fancy. Abel

Chris McDonough

24 Jun 24 Jun

7:51 p.m.

New subject: [Zope-dev] Wildcards in TextIndex query. Do they work?

Abel, many thanks for this analysis, I've put this into the Collector... On Sat, 23 Jun 2001 22:59:32 +0200 abel deuring <adeuring@gmx.net> wrote:

...

Erik,

I'm afraid that your patch does not solve all the problems you mentioned in an earlier mail.

9131

Age (days ago)

9162

Last active (days ago)

List overview

21 comments

8 participants

participants (8)

abel deuring
Andre Schubert
Casey Duncan
Chris McDonough
Christian Robottom Reis
Erik Enge
Michel Pelletier
R. David Murray