[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen
faassen at startifact.com
Mon Jan 15 16:15:46 EST 2007
Andreas Jung wrote:
> --On 15. Januar 2007 15:44:01 +0100 Martijn Faassen
> <faassen at startifact.com> wrote:
>> On 1/15/07, Andreas Jung <lists at zopyx.com> wrote:
>> [snip]
>>> ok, got it. But this problem can be solved easily by changing the
>>> encoding within the preamble.
>>
>> I would say refusing to guess and bailing out with an error message is
>> better in this case. The Zen of Python:
>>
>> In the face of ambiguity, refuse the temptation to guess.
>>
>
> Sorry but I don't get your point. What's happening with a XML inside a ZPT?
My point is that:
u"<?xml version="1.0" encoding="ISO-8859-1"?><foo>Some non-ascii text</foo>"
is confusing at best. One part of this says it's a unicode string, the
other part says it's in encoding latin-1. What is it? What happens to
this if you recode this to, say, UTF-8? What happens to this if you
parse and *then* serialize it? What does the developer expect will
happen? What do users expect when they enter XML in a form and include
an encoding declaration?
I proposed we make nobody worry about this by simply not accepting this.
> - XML data encoded as XXX comes in (either by editing the XML file through
> the ZMI or FTP/WebDAV upload)
>
> - ZPT converts the encoded string to unicode based on the encoding in
> the preamble
>
> - for parsing it is up to the application to decide what to do with the
> data. It is not up to the editor to decide how the ZPT engine should
> deal with XML internally. The ZPT engine decides to serializes the
> unicode string as utf-8 and to fix the XML preamble (which will result
> in a valid XML file
> which should identical with the original file - except the encoding
> might be different).
> I still don't see what should ambiguous with this approach.
Ambiguous in that the string seems to say it's in two encodings at once.
You're then "guessing": you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default)
Regards,
Martijn
More information about the Zope3-dev
mailing list