[Zope] UnicodeError with TextIndexNG2 and stemming

Andreas Jung andreas at andreas-jung.com
Thu Jun 17 11:09:29 EDT 2004


According to Damien, this problem is resolved by installing TextIndexNG 
2.0.8 with a special
fix for Python installations with wide unicode support.

-aj

--On Donnerstag, 17. Juni 2004 16:02 Uhr +0200 Damien Wyart <dw at nuxeo.com> 
wrote:

> Hello,
>
> I'm getting the very same error than the one described in
> http://www.dzug.org/mailinglisten/zope/archive/2004/2004-02/1077812031382
> (in German, but I couldn't find other references to this problem)
>
>     * Module Products.TextIndexNG2.TextIndexNG, line 220, in index_object
>     * Module Products.TextIndexNG2.TextIndexNG, line 370, in _index_object
>     * Module Products.TextIndexNG2.lexicons.StandardLexicon, line 60, in
> getWordIdList     * Module ZODB.Connection, line 562, in setstate
>     * Module ZODB.Connection, line 601, in _set_ghost_state
>
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 40-45:
> unsupported Unicode code range
>
> Looking at ZODB.Connection, I see it is related to cPickle... Not easy
> to debug !
>
> I am not fluent in German, but could guess the main points of the
> discussion. The production machine runs a Redhat 9. The setup is quite
> complicated (with CMF/CPS, LDAP connections and such, so hard to
> reproduce). This doesn't seem to happen when LDAP is disabled, but I'm
> not sure of this.
>
> Andreas Jung, contacted directly, doesn't think this is directly related
> to TextIndexNG2, but maybe the ZODB level. He has never encountered this
> problem directly.
>
> I noticed this only happens on indexes where the stemmer is enabled.
> What I see then in the catalog are indexes content like this :
>
>       SearchableText1 [(u'de\x10m', 1), (u'metier\x10bc256', 1),
>       (u'unpublish\x10567b6500', 1), (u'machin\x102567b', 1),
>       (u'truc\x10n\x102', 1),
>       (u'000000fbc2567b6500c000a800fb001439444adf\x10439444adf
>       \u0149000000fbc2567b6500c000a800fb', 1), (u'zobi\x100fb', 1),
>       (u'bidule\x10bc256', 1), (u'root\x10e\x10b', 1), (u'all\x10\x10e',
>       1), (u'chambre\x10c2567b', 1), (u'm\xe9tiers\x10c2567b', 1),
>       (u'zobi\x10lis', 1)]
>
>       SearchableText   [(u'de', 1), (u'chambre', 1), (u'metiers', 1),
>       (u'unpublish', 1), (u'all', 1), (u'root', 1), (u'metier', 1),
>       (u'000000fbc2567b6500c000a800fb001439444adf', 1), (u'machin', 1),
>       (u'zobi', 1), (u'truc', 1), (u'bidule', 1)]
>
> Here, SearchableText1 has stemming enabled, SearchableText has not.
>
> Sometimes reindexing the index raises the same error, else it is when I
> create a document or a folder, when indexed, I get the error.
>
> grepping for UnicodeError in the log gives lines like these :
>
> 2004-06-17T15:06:13 PROBLEM(100) textindexng UnicodeDecodeError raised
> from cps_grc/portal_repository/315764013__0001 - ignoring unknown
> unicode characters
> 2004-06-17T15:06:36 PROBLEM(100) textindexng UnicodeDecodeError raised
> from cps_grc/portal_repository/1594125709__0001 - ignoring unknown
> unicode characters
> 2004-06-17T15:06:50 PROBLEM(100) textindexng UnicodeDecodeError raised
> from cps_grc/portal_repository/1594125709__0001 - ignoring unknown
> unicode characters
>
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
> illegal encoding
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
> illegal encoding
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10:
> unsupported Unicode code range
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14:
> unexpected code byte
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-10:
> unsupported Unicode code range
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 14:
> unexpected code byte
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-7:
> illegal encoding
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 24-27:
> illegal encoding
>
>
>
> As I use CPS, I use CPSSchemas, which, in its field machinery, has
> functions making conversion to and from UTF8 (the kind of errors I get
> rings a bell it might come from there). The code looks like this :
>
>
> def toUTF8(s):
>     if not isinstance(s, UnicodeType):
>         s = unicode(s, default_encoding)
>     return s.encode('utf-8')
>
> def fromUTF8(s):
>     return unicode(s, 'utf-8').encode(default_encoding)
>
>
> It is not fully symmetric, as toUTF8 might accept non Unicode strings,
> whereas fromUTF8 always returns a Unicode string. Could this be a
> problem in our case ? After all, TextOndexNG without stemming works
> fine, so maybe it has nothing to do with this.
>
>
> Does someone has ideas on how to debug this further ?
>
>
> Thanks,
>
> --
> Damien
>
> _______________________________________________
> Zope maillist  -  Zope at zope.org
> http://mail.zope.org/mailman-20/listinfo/zope
> **   No cross posts or HTML encoding!  **
> (Related lists -
>  http://mail.zope.org/mailman-20/listinfo/zope-announce
>  http://mail.zope.org/mailman-20/listinfo/zope-dev )







More information about the Zope mailing list