[ZPT] Unicode and 8-bit string migration fix

Thu Oct 30 20:55:58 EST 2003

Barry Warsaw wrote:
> I think the basic philosophy ought to be: convert to and from unicode at
> the farthest boundaries you can get away with, treat everything
> internally as unicode always, and never mix unicode and 8-bit strings.

[because of the relevance to Formulator development, I cc-ed 
formulator-dev at lists.infrae.com as well]

Just noticed this thread. This is good advice, and I agree completely
and this is my basic philosophy, but at the same time, getting from here 
to there in Zope 2 land is..non-trivial.

Here's a real world story for your edification, amusement and/or horror. :)

Earlier this year we at Infrae went through a rather involved process of
making sure the entire core of Silva only stores unicode; we'd been using
a hybrid mix before (unicode for XML, something else for non XML) which is
of course a constant source of bugs. So as soon as Zope 2.6 made it possible
we started to switch over.

We succeeded. We can now mix content in a variety of languages together,
and that's cool.

But I didn't update Formulator (I updated it to produce unicode upon
form submits if desired, but that's a separate story), which is often used 
with Silva. So, Formulator's contents is still stored in whatever encoding
the user's browser was in, not unicode. This means that Formulator and Silva
now bite each other if you don't use plain ascii in your forms.

Unfortunately, I really can't change Formulator to just use unicode everywhere
(and have people convert once) as most of the programming world unfortunately
was like me before 2002; having no clue about unicode, encodings, or what
is up with Python's unicode errors. If Formulator started spitting out
unicode they'd be hopelessly lost. And of course understandably enough
most Zope applications haven't been written to deal with unicode, and
those applications may then break spitting out lots of encoding errors,
scaring people with ordinals in range 128 and such. :)

So I hacked up Formulator so be able to spit out unicode through its API
if set into a certain mode, but keeping the encoding stored in whatever
was specified. I thought this would be the quickest way to make it work,
also not requiring any upgrade scripts. This approach unfortunately turned
out to be a ugly ugly nasty hack with edge cases creating bugs all over. 
I haven't released this hack yet, and I won't, as it's just too terrible; 
parts of the API are used both by code that uses Formulator and internally 
in Formulator, and some parts now need non-unicode in one case and unicode
in another..ugh. Basically unicode problems still pop up all over.

So now I'm thinking about a two-mode Formulator, with a toggle to switch
between 'form property text is unicode' and 'form property text is encoded
in encoding so and so'. Users who wish to do so can turn on the unicode
mode. I realized that since Formulator can export and 
import its state as XML, that's a good boundary to use to implement the 
conversion from unicode and back. The toggle should also switch over all
formulator-internal forms (to edits its properties) to deliver unicode
strings where necessary. I now wish I'd done this before instead
of wasted time on the hack, but one lives and learns..

I guess making a magic StringIO that makes assumptions about the encoding
of non-unicode strings would be better than my hack. But most things
would be better than my hack. I do worry about unforeseen problems though..
I've seen plenty of those during my struggles with Formulator. So perhaps
I prefer the pain so I have to implement a good solution. :)

When I experienced the first of the pain last year I made sure to advocate
that Zope 3 follow a policy of storing text as unicode. This is now the
official policy. It probably needs more testing, but I believe (as
Stuart asks elsewhere) form marshalling indeed delivers unicode.

Regards,

Martijn