[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen
faassen at startifact.com
Mon Jan 15 16:08:58 EST 2007
Tres Seaver wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Andreas Jung wrote:
>> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <chris at simplistix.co.uk>
>> wrote:
>>
>>> Dieter Maurer wrote:
>>>> A halfway intelligent parser would accept Unicode when it gets it
>>>> and concentrate on the remaining part of its task: either reporting
>>>> structural events or building a parse tree.
>>> The trivial fix I use in Twiddler is as follows:
>>>
>>> if isinstance(source,unicode):
>>> source = source.encode('utf-8')
>>>
>>> Of course, this assumes a heading of either <?xml version="1.0"
>>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>>> spec states that the string must be utf-8 encoded.
>> The encoding of the XML preamble should not matter when parsing a XML
>> document stored as unicode string.
>
> That encoding is a *lie*, which is the real problem. Parsers expect it
> to be *correct*, and if missing, expect the text to be encoded as UTF-8,
> per the spec (if the document comes from an HTTP request, then the
> application may supply the encoding from the request headers).
>
> Nothing in the XML specs allows or specifies and behavior for XML
> documents serialized as unicode, becuase such serializations are
> *programming language specific*.
While I agree that the encoding declaration is ambiguous at best and
should be rejected, you can find a bit in the spec which supports XML as
Python unicode strings. A Python unicode string can be seen as a string
with "external character encoding information": it's the native encoding
of Python. Therefore we can make sense of it in an XML parser. For my
previous analysis of the spec see here:
http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
What however is bad and evil is to just ignore conflicting encoding
declarations in an XML document itself. I'd choose either one of:
* bail with a clear error when unicode is supplied at all
* bail with a clear error when unicode is supplied with any explicit
encoding declaration in the XML.
>> It is of importance as soon as you
>> convert the document back to a stream e.g. when we deliver the content
>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with
>> that by changing the encoding parameter of the preamble for XML documents
>> based on the desired output encoding. utf-8 is always a good choice however
>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
>> publisher "avoids" this problem converting the unicode result using
>> errors='replace' (which is likely something we might discuss :-))
>
> Unicode XML is not only problematic for streaming. For instance, you
> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
> a core dump. The API requires that you pass it strings encoded as UTF8.
You can in lxml. :) libxml2 as a C API doesn't even support any unicode
string type as far as I am aware.
Regards,
Martijn
More information about the Zope3-dev
mailing list