[ZODB-Dev] Changing the pickle protocol?

Sun May 23 14:12:50 EDT 2010

On Sun, May 23, 2010 at 5:45 PM, Jim Fulton <jim at zope.com> wrote:
> On Sat, May 22, 2010 at 12:30 PM, Hanno Schlichting <hanno at hannosch.eu> wrote:
>> - The code to make the protocol configurable on all levels (storage,
>> index, persistent cache, ...) is large and ugly,
>
> I'm puzzled.  Why were changes so extensive?  All existing code
> should be able to read protocol 2 pickles.  I would have expected a change
> in ZODB.serialiize.ObjectWriter only. Can you explain why more extensive
> changes were necessary?

They weren't really necessary. I just made the protocol for all the
different things configurable. So a ZEO client could use a different
protocol than the storage. And the protocol for the ZEO client would
influence the persistent cache and the index for that cache and so on.
In total there's 17 different cPickle.Pickler objects, which all need
to figure out the protocol to use in some way and are currently
hardcoded to either protocol 0 or 1.

This was motivated by making it easy to test the different protocols
against each other in one codebase. If I were to do this for real, I
wouldn't make the protocol configurable at all or only at the storage
level.

>> - Protocol 2 is only more efficient at dealing with boolean values,
>> small tuples and longs - all infrequent in my type of data
>
> Hm, interesting.  I wasn't aware of those benefits.

This is the full list of new opcodes in protocol 2:

/* Protocol 2. */
#define PROTO    '\x80' /* identify pickle protocol */
#define NEWOBJ   '\x81' /* build object by applying cls.__new__ to argtuple */
#define EXT1     '\x82' /* push object from extension registry; 1-byte index */
#define EXT2     '\x83' /* ditto, but 2-byte index */
#define EXT4     '\x84' /* ditto, but 4-byte index */
#define TUPLE1   '\x85' /* build 1-tuple from stack top */
#define TUPLE2   '\x86' /* build 2-tuple from two topmost stack items */
#define TUPLE3   '\x87' /* build 3-tuple from three topmost stack items */
#define NEWTRUE  '\x88' /* push True */
#define NEWFALSE '\x89' /* push False */
#define LONG1    '\x8a' /* push long from < 256 bytes */
#define LONG4    '\x8b' /* push really big long */

The most interesting is probably longs, quoting the PEP (and confirmed
in the code):

Pickling and unpickling Python longs takes time quadratic in the
number of digits, in protocols 0 and 1. Under protocol 2, new opcodes
support linear-time pickling and unpickling of longs.

Basically before protocol 2, the repr() is used and afterwards there's
a dedicated opcode representation.

But none of this is particularly exciting. I expect that protocol 3 as
used in Python 3 for unicode/bytes representation is going to be much
more interesting. But that's a whole different story. It might get
easier if we'd centralize the cPickle.Pickler creation in some helper
function, so it could be updated in one place, instead of the 17
current ones. But that's all nice-to-have.

Hanno