[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen
faassen at startifact.com
Mon Jan 15 07:26:16 EST 2007
Andreas Jung wrote:
[snip]
[Bernd Dorn]
>> IMHO it should only accept strings, because in the value should be a xml
>> string and therefore always has to be encoded in 'utf-8' or in the
>> encoding specified in the processing instruction.
>>
>
> I disagree with that. Since Zope 3 is supposed to use unicode internally
> (at least that's the legend) it should support unicode also at the
> parser level. Other languages like Java store XML also as unicode
> strings and support parsing it.
Bernd Dorn raises a good point though, and it's one you need to think
about carefully. To say "languages like Java store XML also as unicode"
is rather ambiguous. While I'm not aware of the details of Java,
serialized XML is typically stored in some encoded form, most commonly
UTF-8 (the default 8 bit encoding), but latin 1 is also supported, and
there are also multi-byte encodings. *Parsed* XML exposed through a DOM
is exposed as unicode strings. I'm sure Java supports this usage
patterns, as naturally files on disk need to be parsable.
Here you are talking about parsing XML, so maintaining the position that
this should be encoded is a reasonable one. This is how for instance the
Python ElementTree operates (parse encoded, expose API as unicode (or
pure ascii)), and this has been designed by Fredrik Lundh, who, as you
may know, was instrumental in developing Python's unicode support.
How would you propose to parse the following unicode string?
u"<?xml version="1.0" encoding="ISO-8859-1"?><foo />"
If you are going to allow the parsing of unicode strings, I would
strongly recommend *rejecting* any unicode string that itself declares
an encoding as ambiguous: refuse to guess.
With lxml (which is an extension of the ElementTree API) we've taken the
latter option: it's possible to pass a unicode string into the parser,
but if that contains an encoding declaration, there will be an error.
Underneath we actually re-encode this string back to UTF-8, as that's
what the libxml2 parser expects. We made this change with the objections
of Fredrik Lundh by the way - we felt user errors would be mostly
prevented because it refuses to guess.
Regards,
Martijn
More information about the Zope3-dev
mailing list