[ZODB-Dev] Towards ZODB on Python 3

Fri Mar 8 14:31:51 UTC 2013

(Resending because I used the wrong From address and the mail got stuck
in moderation.)

Some goals, in order of decreasing priority

1. ZODB should work on Python 3

2. ZODB databases created on Python 2 should be loadable with ZODB on
   Python 3.

3. ZODB databases created on Python 3 should be loadable with ZODB on
   Python 2.

This will be kinda longish, so please settle down.

Now, ZODB is built on top of pickles.  And pickles in Python 2 know about
two kinds of strings: str and unicode.  But there are actually *three*
kinds of strings in Python-land:

  * bytes
  * unicode
  * native strings (same as bytes in Python 2, same as unicode in Python 3)

Unfortunately we cannot distinguish bytes from native strings in the
pickles produced on Python 2: both kinds are pickled as STRING, BINSTRING
or SHORT_BINSTRING opcodes.  If we assume they're native strings, we
can break pickles that contain binary data, in one of two possible ways:

  i.   assume 'ascii' and raise UnicodeDecodeError while loading

  ii.  assume 'latin-1' and silently give applications unicode objects
       where they expect strings

  iii. assume 'utf-8' and combine the disadvantages of both of the above
       methods: sometimes fail, sometimes return unicode where applications
       expect bytes

One very common example of binary data: persistent object references.

What if we break stride with the standard library pickle, do our own
pickle[1] and load BINSTRINGs as bytes?

  iv.  assume bytes [2]

Then we break *every object instance* by putting byte strings into the
instance __dict__ on Python 3:

   >>> obj.__dict__[b'attr'] = value
   >>> obj.attr
   Traceback ...
   AttributeError: ...

What if we try to detect which SHORT_BINSTRINGs are bytes and which ones
are native strings?

  v.   try to decode 'ascii', if that fails, return bytes [3]

Then we, again, get the disadvantage of approach (ii), only in a very
inconsistent manner: sometimes pickled binary data unpickles into
unicode.  Half of your OIDs are now u'\0\0\0\0\0\0\0\x7f', the other
half is b'\0\0\0\0\0\0\0\x80'.  ZODB itself can cope with that [4], but
will someone think of the childre^H^H^H^H^H applications?

What if we introduce a way for applications to specify whether they want
bytes or unicode?

  vi.  define an explicit schema of some kind for each Persistent subclass,
       e.g. _p_load_as_bytes = ('names', 'of', 'attributes'); advanced
       users can override __setstate__ and do type fixups in there

I don't know.  I haven't had the time to think this through yet.  It
sounds like a huge amount of work for everyone.

  [1] https://github.com/zopefoundation/zodbpickle
  [2] zodbpickle.pickle.Unpickler(encoding='bytes')
  [3] zodbpickle.pickle.Unpickler(encoding='ascii', errors='bytes')
  [4] this is the status quo of the 'py3' branch in the ZODB repo

That's the situation with loading.  I've implemented approach (v) in the
ZODB py3 branch, but I'm by no means certain it is acceptable.  But
that's not all, there's more fun to be had on the dumping side too!

We want pickles created by ZODB to be

  a) reasonably short
  b) round-trippable (what you dump, you get back on load)
  c) compatible with Python 2
  d) noload()able [5]

  [5] i.e. we want to be able to do garbage collection without actually
      instantiating user-defined classes (think of a ZEO server that
      doesn't have the right modules in sys.path, or standalone zodbgc
      processing), which is why we added noload() back into zodbpickle.
      noload() must be able to crawl the pickles and get back OIDs from
      persistent references.

There are problems with each of these requirements, and solutions for
those problems make the other requirements impossible to implement.

  * Python 3 pickles bytestrings using a fancy REDUCE opcode, as a
    function call to codecs.encode(u'decoded bytestring', 'latin-1').
    This makes them large and breaks (a), and our noload() copied from
    Python 2.x stdlib is unable to handle them, breaking (d). [8]

  * Why does Python 3 pickle bytestrings this way?  Because that's the
    only way to get round-trippability with Python's intepretation of
    BINSTRING opcodes as unicode, if you use pickle protocols 0, 1, or
    2.  Pickle protocol 3 has separate opcodes for all three kinds of
    strings (bytes, unicode, native -- remember?), but it's incompatible
    with Python 2, breaking requirement (c).

  * We could implement a custom pickler [6] and pickle bytestrings as
    SHORT_BINSTRING, fulfilling requirement (a) and (c) and (d), but
    this breaks (b), i.e. round-tripping.

  [6] zodbpickle.pickle.Pickler(bytes_as_strings=True) [7]
  [7] this is the status quo of the 'py3' branch in the ZODB repo
  [8] OTOH we could implement special support for REDUCE of
      codecs.decode() in our noload -- I almost got that working before
      Jim suggested a different approach, which is [6].

At least there's some nice symmetry: no matter if you pickle your stuff
on Python 2 or Python 3, you get to deal with bytes becoming unicode
when you unpickle.  These kinds of guessing games are inevitable when
you're migrating pickles from Python 2 to Python 3, but do we want to
make them mandatory for day-to-day operation?

Perhaps we ought to drop our original goal (3) and require an explicit
one-time possibly-lossy conversion process for goal (2), then use pickle
protocol 3 on Python 3 and have short pickles, perfect roundtripping of
bytestrings?

Then there's ZEO, which uses pickles for both payloads _and_ for
marshalling in its RPC layer.  That's also fun, but I think we can at
least declare that ZEO server and client must be on the same Python
version, perhaps by bumping the protocol version.

So, this is where things stand right now.  Plus a few relatively minor
matters like adding missing noload() tests to zodbpickle and making
zodbpickle work on Python 3.2 [9]

  [9] https://mail.zope.org/pipermail/checkins/2013-March/065813.html

Other than that, the ZODB py3 branch works on Python 3.3 [10].  As long as
you're prepared to deal with bytestrings magically transforming into
unicodes.

  [10] Stephan reported running an actual small demo application with it.

Where do we go from here?

Marius Gedminas
-- 
Basically, what "Ajax" means is "Javascript now works."
        -- Paul Graham
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Digital signature
URL: <http://mail.zope.org/pipermail/zodb-dev/attachments/20130308/bd73a9f1/attachment.sig>