[ZPT] Fix for UnicodeError: ASCII decoding error: ordinal not in range(128)

24 Jan 2003 03:05:51 +0100

Hi Folks,

Sorry for the crosspost but this really covers ZPT and Localizer, and is
of great interest to the Plone i18n users. Please keep your answers to
the lists where they are legitimate -- and I'd appreciate being kept as
Cc.

Ok, I got down to the reason for the infamous "UnicodeError: ASCII
decoding error: ordinal not in range(128)". Thanks to all who cooperated
in that matter.

Readers wanting the quick solution without the rest of the discussion
can skip to the part bracketed by #######.

First a reminder of the problem for those not familiar with it.

In many situations, in a multilingual Plone site using Localizer, people
got the above error.

This in fact happened in the following circumstances:

- A page template like:
        <h1 i18n:translate="edit_type_header">
        Edit an object of type
          <span i18n:name="type">
            <span i18n:translate=""         
                  tal:content="python:here.getTypeInfo().Title()" 
                  tal:omit-tag="">Type</span>
            </span> 
        </h1>

- A translation for type_header of the form
        Éditer un objet de type ${type}
  where the translation contains non-ascii characters ("É" here),

- A substituted string for ${type} that itself has non-ascii characters,
  for instance "déjà".

What happens behind the scene during the template evaluation is complex,
but at some point the <span i18n:translate> gets evaluated, the message
catalog gets consulted and a u'déjà', as Unicode, is returned.

At that point Localizer has a mechanism to convert all non-Unicode
strings to their final browser encoding, in a plain string of bytes,
so for instance using UTF-8 it would substitue 'd\xc3\xa9j\xc3\xa0'.

The problem here is that this string is not destined to go to the
browser yet, but will first be used further in the ZPT processing to be
substituted for ${type}. So later in the processing, we have to
substitute
     u'Éditer un objet de type ${type}'
using the mapping
     {u'type': 'd\xc3\xa9j\xc3\xa0'}

At that point, we have a mix of Unicode (which is legitimate) and some
plain string encoded in the final output. This encoding came too soon!
We would still like to have Unicode here... If we still had it it would
work.

Fortunately, I kind of foresaw this sort of problem a few months ago,
and I included in Localizer a way to turn off its early conversion to
browser output encoding.

#######

To do that, you have to launch Zope with the LOCALIZER_USE_ZOPE_UNICODE
environment variable set to something not empty, for instance "yes".

#######

Now, why did Localizer choose to do early encoding by default? The
problem is the following: during ZPT parsing, we're building something
from the concatenation of a list of strings, some which are Unicode if
they come from a message catalog (or some TALES returning Unicode), some
which are plain strings like most of the page template itself.

If all the plain strings are only ever pure ASCII, then there's no
problem doing a join of all of them with something Unicode, and the
result will be Unicode. That's what pure Zope 2.6 does by default. It
then, in ZPublisher, proceeds to encode that resulting Unicode string in
the preferred browser encoding and sends that. This mode is what you get
if you define LOCALIZER_USE_ZOPE_UNICODE.

But when Localizer was introduced, it was to be used by people who had
localized their page templates by hand and thus included a lot of
non-ASCII characters in them, in their preferred encoding, say, UTF-8,
together with a RESPONSE.setHeader('Content-Type') with that encoding.
So because of those non-ASCII characters, the strategy of the previous
paragraph wouldn't work. So Localizer decided to encode all Unicode
strings to the preferred encoding (assumed to be the same as the browser
encoding) as soon as it saw them inside the ZPT parsing.

Unfortunately, as we saw at the beginning, this can't work in the
presence of i18n:name substitutions.

As a conclusion, I recommend that Localizer use the standard Zope
behavior by default, and only enable its early conversion when some new
environment variable, for instance LOCALIZER_UNICODE_CONVERSION, is set.
This will only be useful to people who have half-translated their site
(some Unicode from the message catalog, and still some non-ASCII in the
templates).

A final digression about ZPT:

I think the correct way to build the result of a ZPT would be to build a
Unicode strings as soon as TALIntepreter detects a non-ASCII string. It
would then encode the non-ASCII to Unicode using some kind of site- or
page-default encoding. This would avoid most of our problems, and would
anyway be more robust. It would simply mean replacing StringIO's
(actually FasterStringIO's) getvalue method with an intelligent join
that does the conversion I just outlined if needed.

There remains the problem of deciding which is the default encoding to
use...

Thanks for any comments (and please watch where you send them!).

Florent

-- 
Florent Guillaume, Nuxeo (Paris, France)
+33 1 40 33 79 87  http://nuxeo.com  mailto:fg@nuxeo.com