[Zope] UnicodeError with TextIndexNG2 and stemming

Thu Jun 17 10:02:11 EDT 2004

Hello,

I'm getting the very same error than the one described in
http://www.dzug.org/mailinglisten/zope/archive/2004/2004-02/1077812031382
(in German, but I couldn't find other references to this problem)

    * Module Products.TextIndexNG2.TextIndexNG, line 220, in index_object
    * Module Products.TextIndexNG2.TextIndexNG, line 370, in _index_object
    * Module Products.TextIndexNG2.lexicons.StandardLexicon, line 60, in getWordIdList
    * Module ZODB.Connection, line 562, in setstate
    * Module ZODB.Connection, line 601, in _set_ghost_state

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-45:
unsupported Unicode code range

Looking at ZODB.Connection, I see it is related to cPickle... Not easy
to debug !

I am not fluent in German, but could guess the main points of the
discussion. The production machine runs a Redhat 9. The setup is quite
complicated (with CMF/CPS, LDAP connections and such, so hard to
reproduce). This doesn't seem to happen when LDAP is disabled, but I'm
not sure of this.

Andreas Jung, contacted directly, doesn't think this is directly related
to TextIndexNG2, but maybe the ZODB level. He has never encountered this
problem directly.

I noticed this only happens on indexes where the stemmer is enabled.
What I see then in the catalog are indexes content like this :

      SearchableText1 [(u'de\x10m', 1), (u'metier\x10bc256', 1),
      (u'unpublish\x10567b6500', 1), (u'machin\x102567b', 1),
      (u'truc\x10n\x102', 1),
      (u'000000fbc2567b6500c000a800fb001439444adf\x10439444adf
      \u0149000000fbc2567b6500c000a800fb', 1), (u'zobi\x100fb', 1),
      (u'bidule\x10bc256', 1), (u'root\x10e\x10b', 1), (u'all\x10\x10e',
      1), (u'chambre\x10c2567b', 1), (u'm\xe9tiers\x10c2567b', 1),
      (u'zobi\x10lis', 1)]

      SearchableText   [(u'de', 1), (u'chambre', 1), (u'metiers', 1),
      (u'unpublish', 1), (u'all', 1), (u'root', 1), (u'metier', 1),
      (u'000000fbc2567b6500c000a800fb001439444adf', 1), (u'machin', 1),
      (u'zobi', 1), (u'truc', 1), (u'bidule', 1)]

Here, SearchableText1 has stemming enabled, SearchableText has not.

Sometimes reindexing the index raises the same error, else it is when I
create a document or a folder, when indexed, I get the error.

grepping for UnicodeError in the log gives lines like these :

2004-06-17T15:06:13 PROBLEM(100) textindexng UnicodeDecodeError raised
from cps_grc/portal_repository/315764013__0001 - ignoring unknown
unicode characters
2004-06-17T15:06:36 PROBLEM(100) textindexng UnicodeDecodeError raised
from cps_grc/portal_repository/1594125709__0001 - ignoring unknown
unicode characters
2004-06-17T15:06:50 PROBLEM(100) textindexng UnicodeDecodeError raised
from cps_grc/portal_repository/1594125709__0001 - ignoring unknown
unicode characters

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
illegal encoding
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
illegal encoding
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10:
unsupported Unicode code range
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14:
unexpected code byte
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10:
unsupported Unicode code range
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14:
unexpected code byte
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
illegal encoding
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 24-27:
illegal encoding

As I use CPS, I use CPSSchemas, which, in its field machinery, has
functions making conversion to and from UTF8 (the kind of errors I get
rings a bell it might come from there). The code looks like this :

def toUTF8(s):
    if not isinstance(s, UnicodeType):
        s = unicode(s, default_encoding)
    return s.encode('utf-8')

def fromUTF8(s):
    return unicode(s, 'utf-8').encode(default_encoding)

It is not fully symmetric, as toUTF8 might accept non Unicode strings,
whereas fromUTF8 always returns a Unicode string. Could this be a
problem in our case ? After all, TextOndexNG without stemming works
fine, so maybe it has nothing to do with this.

Does someone has ideas on how to debug this further ?

Thanks,

-- 
Damien