According to Damien, this problem is resolved by installing TextIndexNG 2.0.8 with a special fix for Python installations with wide unicode support. -aj --On Donnerstag, 17. Juni 2004 16:02 Uhr +0200 Damien Wyart <dw@nuxeo.com> wrote:
Hello,
I'm getting the very same error than the one described in http://www.dzug.org/mailinglisten/zope/archive/2004/2004-02/1077812031382 (in German, but I couldn't find other references to this problem)
* Module Products.TextIndexNG2.TextIndexNG, line 220, in index_object * Module Products.TextIndexNG2.TextIndexNG, line 370, in _index_object * Module Products.TextIndexNG2.lexicons.StandardLexicon, line 60, in getWordIdList * Module ZODB.Connection, line 562, in setstate * Module ZODB.Connection, line 601, in _set_ghost_state
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-45: unsupported Unicode code range
Looking at ZODB.Connection, I see it is related to cPickle... Not easy to debug !
I am not fluent in German, but could guess the main points of the discussion. The production machine runs a Redhat 9. The setup is quite complicated (with CMF/CPS, LDAP connections and such, so hard to reproduce). This doesn't seem to happen when LDAP is disabled, but I'm not sure of this.
Andreas Jung, contacted directly, doesn't think this is directly related to TextIndexNG2, but maybe the ZODB level. He has never encountered this problem directly.
I noticed this only happens on indexes where the stemmer is enabled. What I see then in the catalog are indexes content like this :
SearchableText1 [(u'de\x10m', 1), (u'metier\x10bc256', 1), (u'unpublish\x10567b6500', 1), (u'machin\x102567b', 1), (u'truc\x10n\x102', 1), (u'000000fbc2567b6500c000a800fb001439444adf\x10439444adf \u0149000000fbc2567b6500c000a800fb', 1), (u'zobi\x100fb', 1), (u'bidule\x10bc256', 1), (u'root\x10e\x10b', 1), (u'all\x10\x10e', 1), (u'chambre\x10c2567b', 1), (u'm\xe9tiers\x10c2567b', 1), (u'zobi\x10lis', 1)]
SearchableText [(u'de', 1), (u'chambre', 1), (u'metiers', 1), (u'unpublish', 1), (u'all', 1), (u'root', 1), (u'metier', 1), (u'000000fbc2567b6500c000a800fb001439444adf', 1), (u'machin', 1), (u'zobi', 1), (u'truc', 1), (u'bidule', 1)]
Here, SearchableText1 has stemming enabled, SearchableText has not.
Sometimes reindexing the index raises the same error, else it is when I create a document or a folder, when indexed, I get the error.
grepping for UnicodeError in the log gives lines like these :
2004-06-17T15:06:13 PROBLEM(100) textindexng UnicodeDecodeError raised from cps_grc/portal_repository/315764013__0001 - ignoring unknown unicode characters 2004-06-17T15:06:36 PROBLEM(100) textindexng UnicodeDecodeError raised from cps_grc/portal_repository/1594125709__0001 - ignoring unknown unicode characters 2004-06-17T15:06:50 PROBLEM(100) textindexng UnicodeDecodeError raised from cps_grc/portal_repository/1594125709__0001 - ignoring unknown unicode characters
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: illegal encoding UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: illegal encoding UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10: unsupported Unicode code range UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14: unexpected code byte UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10: unsupported Unicode code range UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14: unexpected code byte UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7: illegal encoding UnicodeDecodeError: 'utf8' codec can't decode bytes in position 24-27: illegal encoding
As I use CPS, I use CPSSchemas, which, in its field machinery, has functions making conversion to and from UTF8 (the kind of errors I get rings a bell it might come from there). The code looks like this :
def toUTF8(s): if not isinstance(s, UnicodeType): s = unicode(s, default_encoding) return s.encode('utf-8')
def fromUTF8(s): return unicode(s, 'utf-8').encode(default_encoding)
It is not fully symmetric, as toUTF8 might accept non Unicode strings, whereas fromUTF8 always returns a Unicode string. Could this be a problem in our case ? After all, TextOndexNG without stemming works fine, so maybe it has nothing to do with this.
Does someone has ideas on how to debug this further ?
Thanks,
-- Damien
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman-20/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman-20/listinfo/zope-announce http://mail.zope.org/mailman-20/listinfo/zope-dev )