Dear all, In the zope application I'm developing I'm using unicode strings that contain non-latin1 characters. I've solved "standard" problems with conversion exceptions from/to unicode in the way that is described in previous messages in the list (I've also used setdefaultencoding ('utf-8') since all of the non-unicode strings are in this format). However, regarding the full text search, situation looks more complicated: (1) "Old" Zope Text Index. This index works properly when UnicodeSpliter is used. Unicode properties are indexed, and it is possible to query with star. (for instance: searchResults ( {'ind':'_.unicode ('<alpha>*', 'utf-8')}) works properly. (2) ZCTextIndex. As far as I played with it, it doesn't have unicode splitter. The two splitters I've tried don't work with non-latin1 characters. Lexicon will not even show correct number of words - not-latin1 words are ignored. (3) TextIndexNG (1.09). This index properly divided text in words. However, querying the index with * doesn't produce results if non-latin1 character is used. Querying full words works: - searchResults ({'ind':_.unicode ('a*', 'utf-8')}) - works - searchResults ({'ind':_.unicode ('<alpha><alpha>', 'utf-8')}) - works - searchResults ({'ind':_.unicode ('<alpha>*, 'utf-8')}) - doesn't work. I want to use one of the newer text indexes because of the capabilities they provide. However, I can only make the old index works with unicode text. Does anybody have an idea how can I make (2) or (3) work with arbitrary unicode text. Thanks in advance, Vladimir
--On Donnerstag, 12. Juni 2003 19:38 Uhr +0300 Vladimir Petrovic <vladap@criticalpublics.com> wrote:
(3) TextIndexNG (1.09). This index properly divided text in words. However, querying the index with * doesn't produce results if non-latin1 character is used. Querying full words works:
- searchResults ({'ind':_.unicode ('a*', 'utf-8')}) - works - searchResults ({'ind':_.unicode ('<alpha><alpha>', 'utf-8')}) - works - searchResults ({'ind':_.unicode ('<alpha>*, 'utf-8')}) - doesn't work.
I really can't believe that because the complete internal processing is done with unicode strings. Please check the unittests of TextIndexNG and try to build a unittest that reproduces your problem. -aj
participants (2)
-
Andreas Jung -
Vladimir Petrovic