ZCTextIndex - prefix wildcards not supported?
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'. In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2) the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says" 111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception. Why is this 'too efficient"? Jonathan
TextIndexNG2 supports this feature. -aj --On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.
In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)
the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.
Why is this 'too efficient"?
Jonathan
Thanks for the alternative Andreas, however we currently have ZCTextIndex installed in a ZCatalog that has about 700,000 entries (about 3Gb of data) and would prefer to stay with ZCTextIndex (unless there are some other advantages to move to TextIndexNG2 given our situation?). Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex? jh
TextIndexNG2 supports this feature.
-aj
--On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.
In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)
the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.
Why is this 'too efficient"?
Jonathan
ZCTextIndex does not support left truncation (as far as I know). The reason is that an efficient implementation requires a second internal BTree structure which means more memory consumption. -aj --On Donnerstag, 20. November 2003 13:55 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:
Thanks for the alternative Andreas, however we currently have ZCTextIndex installed in a ZCatalog that has about 700,000 entries (about 3Gb of data) and would prefer to stay with ZCTextIndex (unless there are some other advantages to move to TextIndexNG2 given our situation?).
Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex?
jh
TextIndexNG2 supports this feature.
-aj
--On Donnerstag, 20. November 2003 12:38 Uhr -0500 Small Business Services <toolkit@magma.ca> wrote:
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.
In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10. 2)
the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.
Why is this 'too efficient"?
Jonathan
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Small Business Services wrote at 2003-11-20 13:55 -0500:
... Is there any way/interest in having this '*term' type of wildcard searching built into ZCTextIndex?
You can just remove the statement that raises the exception and see what happens (with respect to runtime). The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys. This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes. -- Dieter
--On Freitag, 21. November 2003 21:58 Uhr +0100 Dieter Maurer <dieter@handshake.de> wrote:
Small Business Services wrote at 2003-11-20 13:55 -0500:
... Is there any way/interest in having this '*term' type of wildcard searching > built into ZCTextIndex?
You can just remove the statement that raises the exception and see what happens (with respect to runtime).
The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys.
This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes.
Truncation can be achived very easily by performing a range search on BTrees....very evil but it works!. -aj´
Andreas Jung wrote at 2003-11-22 09:07 +0100:
Dieter Maurer wrote: ... The standard way to support so called "left truncation" is to build a mapping "R" "s.reverse() --> s" for each word "s" in your index. With this map, "left truncation" can be mapped to "right truncation" as you use your reversed pattern to scan through "R"s keys.
This still does not support truncation on both sides: to make this efficiently, you need sub-word indexes.
Truncation can be achived very easily by performing a range search on BTrees....very evil but it works!.
For patterns of the form "*subword*" (I spoke about truncation on both sides) there is no initial range to retrict the search to (that's why it is inefficient ;-)). Neither is for patterns of the form "*subword" unless you have the "reverse" mapping, I mentioned above. -- Dieter
Andreas Jung wrote at 2003-11-20 18:37 +0100:
...
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.
Why is this 'too efficient"?
Assuming a standard word index, the search has to scan the complete index (more precisely, its keys). This can take some time... -- Dieter
On Thu, 20 Nov 2003 12:38:24 -0500 "Small Business Services" <toolkit@magma.ca> wrote:
Why are wildcards '?' and '*' not supported at the beginning of search terms in ZCTextIndex? It would be very useful to search for terms using '*someterm'.
In the cvs for ZCTextIndex, Lexicon.py (http://cvs.zope.org/Products/ZCTextIndex/Lexicon.py?annotate=1.17.10.2)
the code raises an exception for wildcards at the beginning of search terms (see line 113) and a related comment says"
111 # The pattern starts with a globbing character. 112 # This is too efficient, so we raise an exception.
Why is this 'too efficient"?
I think it should sat "too inefficient". The data structures in the lexicon as it is currently implemented cannot efficiently return all of the matching words for *foo. It would require iterating all of the words in the lexicon. As Andreas said, it would be possible to implement this efficiently if the lexicon kept a separate head globbing index, but this would greatly increase the size of the lexicon and would make updates somewhat more expensive (although probably not too much in steady-state). I'm curious, you said you had 700,000 some-odd documents in your catalog. How many words are in the lexicon(s) you have? -Casey
--On Freitag, 21. November 2003 12:13 Uhr -0500 Casey Duncan <casey@zope.com> wrote:
As Andreas said, it would be possible to implement this efficiently if the lexicon kept a separate head globbing index, but this would greatly increase the size of the lexicon and would make updates somewhat more expensive (although probably not too much in steady-state).
I'm curious, you said you had 700,000 some-odd documents in your catalog. How many words are in the lexicon(s) you have?
I think it is not too inefficient since left globbing requires only an additional btree in the lexicon (mapping word to word ids). Also for a large number of words (let's say 50.000) this datastructure is small compared to the storage of the mapping word id -> doc ids. -aj
participants (4)
-
Andreas Jung -
Casey Duncan -
Dieter Maurer -
Small Business Services