[Zope] Unicode strings and search

Thu, 12 Jun 2003 19:38:00 +0300

Dear all,

In the zope application I'm developing I'm using unicode strings that contain 
non-latin1 characters. I've solved "standard" problems with conversion 
exceptions from/to unicode in the way that is described in previous messages 
in the list (I've also used setdefaultencoding ('utf-8') since all of the 
non-unicode strings are in this format).  

However, regarding the full text search, situation looks more complicated:

(1) "Old" Zope Text Index. This index works properly when UnicodeSpliter is 
used. Unicode properties are indexed, and it is possible to query with star.
(for instance: searchResults ( {'ind':'_.unicode ('<alpha>*', 'utf-8')}) works 
properly.

(2) ZCTextIndex. As far as I played with it, it doesn't have unicode splitter. 
The two splitters I've tried don't work with non-latin1 characters. Lexicon 
will not even show correct number of words - not-latin1 words are ignored.

(3) TextIndexNG (1.09). This index properly divided text in words. However, 
querying the index with * doesn't produce results if non-latin1 character is 
used. Querying full words works:

- searchResults ({'ind':_.unicode ('a*', 'utf-8')}) - works
- searchResults ({'ind':_.unicode ('<alpha><alpha>', 'utf-8')}) - works
- searchResults ({'ind':_.unicode ('<alpha>*, 'utf-8')}) - doesn't work.

I want to use one of the newer text indexes because of the capabilities they 
provide. However, I can only make the old index works with unicode text.
Does anybody have an idea how can I make (2) or (3) work with arbitrary 
unicode text.

Thanks in advance,
Vladimir