[ZODB-Dev] Towards ZODB on Python 3

Sun Mar 10 13:19:03 UTC 2013

On Fri, Mar 8, 2013 at 9:31 AM, Marius Gedminas <marius at gedmin.as> wrote:
> (Resending because I used the wrong From address and the mail got stuck
> in moderation.)
>
>
> Some goals, in order of decreasing priority
>
> 1. ZODB should work on Python 3
>
> 2. ZODB databases created on Python 2 should be loadable with ZODB on
>    Python 3.
>
> 3. ZODB databases created on Python 3 should be loadable with ZODB on
>    Python 2.
>
>
> This will be kinda longish, so please settle down.
>
>
> Now, ZODB is built on top of pickles.  And pickles in Python 2 know about
> two kinds of strings: str and unicode.  But there are actually *three*
> kinds of strings in Python-land:
>
>   * bytes
>   * unicode
>   * native strings (same as bytes in Python 2, same as unicode in Python 3)

I hadn't encountered that term before.  I see it informally used to
refer to ``str``, which is bytes in Python 2 and Unicode in Python 3.
This isn't a different kind of string.

> Unfortunately we cannot distinguish bytes from native strings in the
> pickles produced on Python 2: both kinds are pickled as STRING, BINSTRING
> or SHORT_BINSTRING opcodes.  If we assume they're native strings, we
> can break pickles that contain binary data, in one of two possible ways:
>
>   i.   assume 'ascii' and raise UnicodeDecodeError while loading
>
>   ii.  assume 'latin-1' and silently give applications unicode objects
>        where they expect strings
>
>   iii. assume 'utf-8' and combine the disadvantages of both of the above
>        methods: sometimes fail, sometimes return unicode where applications
>        expect bytes
>
> One very common example of binary data: persistent object references.
>
> What if we break stride with the standard library pickle, do our own
> pickle[1] and load BINSTRINGs as bytes?
>
>   iv.  assume bytes [2]
>
> Then we break *every object instance* by putting byte strings into the
> instance __dict__ on Python 3:
>
>    >>> obj.__dict__[b'attr'] = value
>    >>> obj.attr
>    Traceback ...
>    AttributeError: ...
>
> What if we try to detect which SHORT_BINSTRINGs are bytes and which ones
> are native strings?
>
>   v.   try to decode 'ascii', if that fails, return bytes [3]
>
> Then we, again, get the disadvantage of approach (ii), only in a very
> inconsistent manner: sometimes pickled binary data unpickles into
> unicode.  Half of your OIDs are now u'\0\0\0\0\0\0\0\x7f', the other
> half is b'\0\0\0\0\0\0\0\x80'.  ZODB itself can cope with that [4], but
> will someone think of the childre^H^H^H^H^H applications?
>
> What if we introduce a way for applications to specify whether they want
> bytes or unicode?
>
>   vi.  define an explicit schema of some kind for each Persistent subclass,
>        e.g. _p_load_as_bytes = ('names', 'of', 'attributes'); advanced
>        users can override __setstate__ and do type fixups in there
>
> I don't know.  I haven't had the time to think this through yet.  It
> sounds like a huge amount of work for everyone.
>
>   [1] https://github.com/zopefoundation/zodbpickle
>   [2] zodbpickle.pickle.Unpickler(encoding='bytes')
>   [3] zodbpickle.pickle.Unpickler(encoding='ascii', errors='bytes')
>   [4] this is the status quo of the 'py3' branch in the ZODB repo
>
> That's the situation with loading.  I've implemented approach (v) in the
> ZODB py3 branch, but I'm by no means certain it is acceptable.  But
> that's not all, there's more fun to be had on the dumping side too!
>
>
> We want pickles created by ZODB to be
>
>   a) reasonably short
>   b) round-trippable (what you dump, you get back on load)
>   c) compatible with Python 2
>   d) noload()able [5]
>
>   [5] i.e. we want to be able to do garbage collection without actually
>       instantiating user-defined classes (think of a ZEO server that
>       doesn't have the right modules in sys.path, or standalone zodbgc
>       processing), which is why we added noload() back into zodbpickle.
>       noload() must be able to crawl the pickles and get back OIDs from
>       persistent references.
>
> There are problems with each of these requirements, and solutions for
> those problems make the other requirements impossible to implement.
>
>   * Python 3 pickles bytestrings using a fancy REDUCE opcode, as a
>     function call to codecs.encode(u'decoded bytestring', 'latin-1').
>     This makes them large and breaks (a), and our noload() copied from
>     Python 2.x stdlib is unable to handle them, breaking (d). [8]
>
>   * Why does Python 3 pickle bytestrings this way?  Because that's the
>     only way to get round-trippability with Python's intepretation of
>     BINSTRING opcodes as unicode, if you use pickle protocols 0, 1, or
>     2.  Pickle protocol 3 has separate opcodes for all three kinds of
>     strings (bytes, unicode, native -- remember?), but it's incompatible
>     with Python 2, breaking requirement (c).
>
>   * We could implement a custom pickler [6] and pickle bytestrings as
>     SHORT_BINSTRING, fulfilling requirement (a) and (c) and (d), but
>     this breaks (b), i.e. round-tripping.
>
>   [6] zodbpickle.pickle.Pickler(bytes_as_strings=True) [7]
>   [7] this is the status quo of the 'py3' branch in the ZODB repo
>   [8] OTOH we could implement special support for REDUCE of
>       codecs.decode() in our noload -- I almost got that working before
>       Jim suggested a different approach, which is [6].
>
> At least there's some nice symmetry: no matter if you pickle your stuff
> on Python 2 or Python 3, you get to deal with bytes becoming unicode
> when you unpickle.  These kinds of guessing games are inevitable when
> you're migrating pickles from Python 2 to Python 3, but do we want to
> make them mandatory for day-to-day operation?
>
> Perhaps we ought to drop our original goal (3) and require an explicit
> one-time possibly-lossy conversion process for goal (2), then use pickle
> protocol 3 on Python 3 and have short pickles, perfect roundtripping of
> bytestrings?
>
>
> Then there's ZEO, which uses pickles for both payloads _and_ for
> marshalling in its RPC layer.  That's also fun, but I think we can at
> least declare that ZEO server and client must be on the same Python
> version, perhaps by bumping the protocol version.
>
>
> So, this is where things stand right now.  Plus a few relatively minor
> matters like adding missing noload() tests to zodbpickle and making
> zodbpickle work on Python 3.2 [9]
>
>   [9] https://mail.zope.org/pipermail/checkins/2013-March/065813.html
>
> Other than that, the ZODB py3 branch works on Python 3.3 [10].  As long as
> you're prepared to deal with bytestrings magically transforming into
> unicodes.
>
>   [10] Stephan reported running an actual small demo application with it.
>
>
> Where do we go from here?

Is this an issue for anything but names (object attributes and global
names)?

I don't think there's a "native strings" issue.  There *does* seem to
be an name issue.  In Python 2 and Python 3, (non-buggy) unicode aware
applications use bytes and unicode the same way, unicode for text,
bytes for data.

AFAICT, Python 3 has (admirably) changed the way names are implemented
to use unicode, rather than ASCII.

Am I missing something?

This is a somewhat thorny, but still fairly restricted problem.  I
would hazard to guess that 99.923% of persistent classes pickle their
state using their instance dictionaries.  99.9968% for regular Python
classes.  We know when we're pickling and unpickling instances and we
can apply transformations necessary for the target platforms.

I think the fix is pretty straightforward.

In the default __setstate__ provided by Persistent, and when loading
non-persistent instances:

- On Python 2, ASCII encode unicode attribute names.

- On Python 3, ASCII decode byte attribute names.

The same transformation is necessary when looking up global names.

This will cover the vast majority of cases where the default
__setstate__ is used.  In rare cases where a custom setstate is used,
or when Python 3 non-ASCII attribute names are used, then databases
may not be sharable accross Python versions.

There is also likely to be breakage in dictionaries or BTrees where
applications are sloppy about mixing Unicode and byte keys.  I don't
think we should try to compensate for this. These applications need to
be fixed.  One could write a database analysis script to detect this
kind of breakage (looking for mixed string and unicide keys).

Jim

--
Jim Fulton
http://www.linkedin.com/in/jimfulton
Jerky is better than bacon! http://zo.pe/Kqm