[Zope] Unicode strings and search
Vladimir Petrovic
vladap@criticalpublics.com
Thu, 12 Jun 2003 19:38:00 +0300
Dear all,
In the zope application I'm developing I'm using unicode strings that contain
non-latin1 characters. I've solved "standard" problems with conversion
exceptions from/to unicode in the way that is described in previous messages
in the list (I've also used setdefaultencoding ('utf-8') since all of the
non-unicode strings are in this format).
However, regarding the full text search, situation looks more complicated:
(1) "Old" Zope Text Index. This index works properly when UnicodeSpliter is
used. Unicode properties are indexed, and it is possible to query with star.
(for instance: searchResults ( {'ind':'_.unicode ('<alpha>*', 'utf-8')}) works
properly.
(2) ZCTextIndex. As far as I played with it, it doesn't have unicode splitter.
The two splitters I've tried don't work with non-latin1 characters. Lexicon
will not even show correct number of words - not-latin1 words are ignored.
(3) TextIndexNG (1.09). This index properly divided text in words. However,
querying the index with * doesn't produce results if non-latin1 character is
used. Querying full words works:
- searchResults ({'ind':_.unicode ('a*', 'utf-8')}) - works
- searchResults ({'ind':_.unicode ('<alpha><alpha>', 'utf-8')}) - works
- searchResults ({'ind':_.unicode ('<alpha>*, 'utf-8')}) - doesn't work.
I want to use one of the newer text indexes because of the capabilities they
provide. However, I can only make the old index works with unicode text.
Does anybody have an idea how can I make (2) or (3) work with arbitrary
unicode text.
Thanks in advance,
Vladimir