ZCTextIndex - prefix wildcards not supported? - Zope

newer
DMTL-IF: Listing Documents that...

ZCTextIndex - prefix wildcards not supported?

older
RE: [Zope] (CMF == ZPT) && (Elvis...

Small Business Services

20 Nov 2003 20 Nov '03

5:38 p.m.

Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'. In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2) the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says" 111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception. Why is this 'too efficient"? Jonathan

Attachments:

attachment.html (text/html — 1.5 KB)

Show replies by date

Andreas Jung

20 Nov 20 Nov

5:37 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

TextIndexNG2 supports this feature. -aj --On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:

...

Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.

In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)

the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"

111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.

Why is this 'too efficient"?

Jonathan

Small Business Services

6:55 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

Thanks for the alternative Andreas, however we currently have ZCTextIndex installed in a ZCatalog that has about 700,000 entries (about 3Gb of data) and would prefer to stay with ZCTextIndex (unless there are some other advantages to move to TextIndexNG2 given our situation?). Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex? jh

...

TextIndexNG2 supports this feature.

-aj

--On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:

...
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.

In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)

the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"

111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.

Why is this 'too efficient"?

Jonathan

Andreas Jung

7:15 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

ZCTextIndex does not support left truncation (as far as I know). The reason is that an efficient implementation requires a second internal BTree structure which means more memory consumption. -aj --On Donnerstag, 20. November 2003 13:55 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:

...

Thanks for the alternative Andreas, however we currently have ZCTextIndex installed in a ZCatalog that has about 700,000 entries (about 3Gb of data) and would prefer to stay with ZCTextIndex (unless there are some other advantages to move to TextIndexNG2 given our situation?).

Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex?

jh

...
TextIndexNG2 supports this feature.

-aj

--On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:

...
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.

In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10. 2)

the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"

111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.

Why is this 'too efficient"?

Jonathan

_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )

Dieter Maurer

21 Nov 21 Nov

8:58 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

Small Business Services wrote at 2003-11-20 13:55 -0500:

...

... Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex?

You can just remove the statement that raises the exception and see what happens (with respect to runtime). The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys. This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes. -- Dieter

Andreas Jung

22 Nov 22 Nov

8:07 a.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

--On Freitag, 21. November 2003 21:58 Uhr +0100 Dieter Maurer <dieter@handshake.de> wrote:

...

Small Business Services wrote at 2003-11-20 13:55 -0500:

...
... Is there any way/interest in having this '*term' type of wildcard searching > built into ZCTextIndex?

You can just remove the statement that raises the exception and see what happens (with respect to runtime).

The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys.

This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes.

Truncation can be achived very easily by performing a range search on BTrees....very evil but it works!. -aj´

Dieter Maurer

5:58 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

Andreas Jung wrote at 2003-11-22 09:07 +0100:

...

...
Dieter Maurer wrote: ... The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys.

This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes.

Truncation can be achived very easily by performing a range search on BTrees....very evil but it works!.

For patterns of the form "*subword*" (I spoke about truncation on both sides) there is no initial range to retrict the search to (that's why it is inefficient ;-)). Neither is for patterns of the form "*subword" unless you have the "reverse" mapping, I mentioned above. -- Dieter

Dieter Maurer

21 Nov 21 Nov

7:53 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

Andreas Jung wrote at 2003-11-20 18:37 +0100:

...

...

...
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.

Why is this 'too efficient"?

Assuming a standard word index, the search has to scan the complete index (more precisely, its keys). This can take some time... -- Dieter

Casey Duncan

5:13 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

On Thu, 20 Nov 2003 12:38:24 -0500 "Small Business Services" <toolkit@magma.ca> wrote:

...

Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.

In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)

the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"

111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.

Why is this 'too efficient"?

I think it should sat "too inefficient". The data structures in the lexicon as it is currently implemented cannot efficiently return all of the matching words for *foo. It would require iterating all of the words in the lexicon. As Andreas said, it would be possible to implement this efficiently if the lexicon kept a separate head globbing index, but this would greatly increase the size of the lexicon and would make updates somewhat more expensive (although probably not too much in steady-state). I'm curious, you said you had 700,000 some-odd documents in your catalog. How many words are in the lexicon(s) you have? -Casey

Andreas Jung

6:02 p.m.

New subject: [Zope] ZCTextIndex - prefix wildcards not supported?

--On Freitag, 21. November 2003 12:13 Uhr -0500 Casey Duncan <casey@zope.com> wrote:

...

As Andreas said, it would be possible to implement this efficiently if the lexicon kept a separate head globbing index, but this would greatly increase the size of the lexicon and would make updates somewhat more expensive (although probably not too much in steady-state).

I'm curious, you said you had 700,000 some-odd documents in your catalog. How many words are in the lexicon(s) you have?

I think it is not too inefficient since left globbing requires only an additional btree in the lexicon (mapping word to word ids). Also for a large number of words (let's say 50.000) this datastructure is small compared to the storage of the mapping word id -> doc ids. -aj

8171

Age (days ago)

8173

Last active (days ago)

List overview

9 comments

4 participants

participants (4)

Andreas Jung
Casey Duncan
Dieter Maurer
Small Business Services