[Zope-Coders] Analysis: BTrees and Unicode and Python
Guido van Rossum
guido@python.org
Fri, 19 Oct 2001 11:52:01 -0400
Good job, Andreas!
> After lots of debugging here an explanation for the behaviour we have
> seen in the unittest:
>
> - The BTrees calls PyCompare_Object() several times before the
> comparison that failed (unicode vs. unicode)
>
> - one of these earlier comparision checks a Python string (containing
> and accented character) against a unicode string and raises a
> unicode exception (ASCII decoding error: ordinal notr in range(128)).
> I assume because the default encoding is ascii.
Note that this was a conscious design decision. Not all the world
uses Latin-1, and many real-world programs and data use different
interpretations of 8-bit characters with the high bit set. Assuming
Latin-1 when comparing to Unicode would be wrong.
> - there is no check in the BTree code to check for an exception after
> PyObject_Compare() and so this error got never cleared
This should be fixed before proceeding.
> - when when trying to compare two identical unicode strings, Python
> calls default_3_way_compare() and runs into the following code:
>
>
> static int
> default_3way_compare(PyObject *v, PyObject *w)
> {
> int c;
> char *vname, *wname;
>
> if (v->ob_type == w->ob_type) {
> /* When comparing these pointers, they must be cast to
> * integer types (i.e. Py_uintptr_t, our spelling of C9X's
> * uintptr_t). ANSI specifies that pointer compares other
> * than == and != to non-related structures are undefined.
> */
> Py_uintptr_t vv = (Py_uintptr_t)v;
> Py_uintptr_t ww = (Py_uintptr_t)w;
> puts("\t\t\tdefcmp 1");
> return (vv < ww) ? -1 : (vv > ww) ? 1 : 0;
> }
>
> This code returns -1 for the two identical unicode strings.
>
> I am not sure if this code is able to compare two unicode strings.
> On the other hand it is still strange that the unittest works when
> replacing the same unicode string in the list with the testdata in the
> unittest
> with self.s as described earlier.
>
> Any ideas about that ?
It is definitely a bug if comparison of two unicode strings ends up
calling default_3way_compare()!
This normally doesn't happen though -- the Unicode object's comparison
code is generally called.
I'd like to see what's on the stack when default_3way_compare is
called with two Unicode objects.
Which Python version is this? 2.1 or 2.1.1?
--Guido van Rossum (home page: http://www.python.org/~guido/)