ZCatalog: updateMetadata and comparing string and unicode
Hi all, I have an item in the portal_catalog of my Plone site that has some string as description. The real object meanwhile has had a code change so the description field now returns unicode. When I now recatalog that object it throws an error: Module Products.ZCatalog.Catalog, line 359, in catalogObject Module Products.ZCatalog.Catalog, line 318, in updateMetadata UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 159: ordinal not in range(128)
/home/maurits/buildout/projectdeploy/parts/zope2/lib/python/Products/ZCatalog/Catalog.py(318)updateMetadata() -> if data.get(index, 0) != newDataRecord:
This happens when the current data in the catalog get compared to the new data. If there is a difference, the new data is stored. But to compare the old string with the new unicode the string is converted to unicode. This fails because the string has non ascii characters in it. So basically what happens is this error:
unicode("ä", 'utf-8') == u"ä" True "ä" == u"ä" Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Logical enough. This can be fixed in ZCatalog: maurits@hobb:~/svn/Zope-210/lib/python/Products/ZCatalog $ svn diff Index: Catalog.py =================================================================== --- Catalog.py (revision 84388) +++ Catalog.py (working copy) @@ -304,7 +304,15 @@ # meta_data is stored as a tuple for efficiency data[index] = newDataRecord else: - if data.get(index, 0) != newDataRecord: + try: + changed = data.get(index, 0) != newDataRecord + except UnicodeDecodeError: + # Converting some string to unicode fails. This + # conversion happens when a string and a unicode need + # to be compared. Those two are not the same, so + # logically there has been a change, so: + changed = True + if changed: data[index] = newDataRecord return index Index: tests/testCatalog.py =================================================================== --- tests/testCatalog.py (revision 84388) +++ tests/testCatalog.py (working copy) @@ -1,3 +1,4 @@ +# -*- coding: utf-8 -*- ############################################################################## # # Copyright (c) 2002 Zope Corporation and Contributors. All Rights Reserved. @@ -177,6 +177,13 @@ def __nonzero__(self): self.fail("__nonzero__() was called") +class zdummyText(ExtensionClass.Base): + def __init__(self, text): + self.text = text + + def title(self): + return self.text + class FakeTraversalError(KeyError): """fake traversal exception for testing""" @@ -261,6 +268,12 @@ data = self._catalog.getMetadataForUID('1') self.assertEqual(data['title'], '1') + text = zdummyText('A string with an accent: \xc3\xa4.') + self._catalog.catalog_object(text, '1') + text.text = unicode("A simple unicode.") + self._catalog.catalog_object(text, '1') + + def testReindexIndexDoesntDoMetadata(self): self.d['0'].num = 9999 self._catalog.reindexIndex('title', {}) =================================================================== With that change it works: on the live site I can edit and save that item without errors. Without the change to the code, the added test fails at precisely the point where the change should be done. But if I change the code the test still fails because something similar goes wrong in the KeywordIndex, with this traceback: =================================================================== Error in test testUpdateMetadata (Products.ZCatalog.tests.testCatalog.TestZCatalog) Traceback (most recent call last): File "unittest.py", line 260, in run testMethod() File "/home/maurits/svn/Zope-210/lib/python/Products/ZCatalog/tests/testCatalog.py", line 274, in testUpdateMetadata self._catalog.catalog_object(text, '1') File "/home/maurits/svn/Zope-210/lib/python/Products/ZCatalog/ZCatalog.py", line 536, in catalog_object update_metadata=update_metadata) File "/home/maurits/svn/Zope-210/lib/python/Products/ZCatalog/Catalog.py", line 368, in catalogObject blah = x.index_object(index, object, threshold) File "/home/maurits/svn/Zope-210/lib/python/Products/PluginIndexes/common/UnIndex.py", line 235, in index_object res += self._index_object(documentId, obj, threshold, attr) File "/home/maurits/svn/Zope-210/lib/python/Products/PluginIndexes/KeywordIndex/KeywordIndex.py", line 85, in _index_object fdiff = difference(oldKeywords, newKeywords) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 25: ordinal not in range(128) =================================================================== This is a bit trickier to fix, as the variable fdiff that is calculated here is needed later on. But at this point I would like to ask: is this a solution direction I want to explore? Is the basic fix above sane? Or should we change nothing here and should add-on developers just be careful of what they let end up in the catalog? The fix/workaround from a user's point of view is: clear and rebuild the catalog as that gets rid of any old data so no comparison needs to be done anymore. That solves the problem for me. For reference, the PoiIssue from above was created when its class had this method (simplified): def Description(self): # return the contents of the details field return self.getRawDetails() And currently the code is this: def Description(self): details = self.getRawDetails() if not isinstance(details, unicode): encoding = getSiteEncoding(self) details = unicode(details, encoding) return details And that change means to solve another occasional unicode error when adding issues in Japanese: http://plone.org/products/poi/issues/135 I am the maintainer of Poi btw, and I am writing some migration code now that triggers this error. So writing some other migration to first fix that recatalog issue specifically for the Poi content is doable too. -- Maurits van Rees | http://maurits.vanrees.org/ Work | http://zestsoftware.nl/ "This is your day, don't let them take it away." [Barlow Girl]
Maurits van Rees wrote at 2008-3-5 23:57 +0000:
... I have an item in the portal_catalog of my Plone site that has some string as description. The real object meanwhile has had a code change so the description field now returns unicode. When I now recatalog that object it throws an error:
Module Products.ZCatalog.Catalog, line 359, in catalogObject Module Products.ZCatalog.Catalog, line 318, in updateMetadata UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 159: ordinal not in range(128)
/home/maurits/buildout/projectdeploy/parts/zope2/lib/python/Products/ZCatalog/Catalog.py(318)updateMetadata() -> if data.get(index, 0) != newDataRecord:
You must not mix "unicode" and "str" as keys in the same index. If you do, errors as the above are very likely. You can try the following approaches: * if you know the encoding used by your "str" objects, you can set Python's default encoding to this encoding. Whenever "unicode" and "str" come together, the "str" is converted to "unicode" using this encoding (which hopefully is the correct one in all such cases). "sys.setdefaultencoding" is only available at startup. Thus, setting "defaultencoding" must happen in a "sitecustomize" or "site" module. * You completely switch to "unicode" for the given index and convert the BTrees used be the index. An index usually uses two BTrees: the so called forward index (usually called "_index") (it maps the index terms to sets of record ids indexed under this term) and the reverse index (usually called "_unindex") (it maps record ids to the values corresponding to these objects). You need to convert the keys of the forward index and the values of the reverse index. For a "FieldIndex", the value is the index term, for a "KeywordIndex" it it a sequence of index terms (all need be converted). The forward index can be converted as follows: self._index = OOBTree(((s.decode(<your encoding>), v) for (s,v) in self._index.items())) The reverse index uses an IOBTree and is similar to the above. But the details depend on index type. -- Dieter
Dieter Maurer wrote:
"sys.setdefaultencoding" is only available at startup. Thus, setting "defaultencoding" must happen in a "sitecustomize" or "site" module.
Or if you're sufficiently devious, it's available any time (not that actually using it is a good idea, but...):
import sys sys.setdefaultencoding Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'setdefaultencoding' del sys.modules['sys'] import sys sys.setdefaultencoding <built-in function setdefaultencoding>
-- Benji York Senior Software Engineer Zope Corporation
participants (3)
-
Benji York -
Dieter Maurer -
Maurits van Rees