[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver
tseaver at palladion.com
Mon Jan 15 16:57:05 EST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Martijn Faassen wrote:
> Tres Seaver wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Andreas Jung wrote:
>>> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <chris at simplistix.co.uk>
>>> wrote:
>>>
>>>> Dieter Maurer wrote:
>>>>> A halfway intelligent parser would accept Unicode when it gets it
>>>>> and concentrate on the remaining part of its task: either reporting
>>>>> structural events or building a parse tree.
>>>> The trivial fix I use in Twiddler is as follows:
>>>>
>>>> if isinstance(source,unicode):
>>>> source = source.encode('utf-8')
>>>>
>>>> Of course, this assumes a heading of either <?xml version="1.0"
>>>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>>>> spec states that the string must be utf-8 encoded.
>>> The encoding of the XML preamble should not matter when parsing a XML
>>> document stored as unicode string.
>> That encoding is a *lie*, which is the real problem. Parsers expect it
>> to be *correct*, and if missing, expect the text to be encoded as UTF-8,
>> per the spec (if the document comes from an HTTP request, then the
>> application may supply the encoding from the request headers).
>>
>> Nothing in the XML specs allows or specifies and behavior for XML
>> documents serialized as unicode, becuase such serializations are
>> *programming language specific*.
>
> While I agree that the encoding declaration is ambiguous at best and
> should be rejected, you can find a bit in the spec which supports XML as
> Python unicode strings. A Python unicode string can be seen as a string
> with "external character encoding information": it's the native encoding
> of Python. Therefore we can make sense of it in an XML parser. For my
> previous analysis of the spec see here:
>
> http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
>
> What however is bad and evil is to just ignore conflicting encoding
> declarations in an XML document itself. I'd choose either one of:
>
> * bail with a clear error when unicode is supplied at all
>
> * bail with a clear error when unicode is supplied with any explicit
> encoding declaration in the XML.
>
>>> It is of importance as soon as you
>>> convert the document back to a stream e.g. when we deliver the content
>>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with
>>> that by changing the encoding parameter of the preamble for XML documents
>>> based on the desired output encoding. utf-8 is always a good choice however
>>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
>>> publisher "avoids" this problem converting the unicode result using
>>> errors='replace' (which is likely something we might discuss :-))
>> Unicode XML is not only problematic for streaming. For instance, you
>> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
>> a core dump. The API requires that you pass it strings encoded as UTF8.
>
> You can in lxml. :) libxml2 as a C API doesn't even support any unicode
> string type as far as I am aware.
It *requires* UTF-8-encoded strings. See http://xmlsoft.org/xml.html
12. So what is this funky "xmlChar" used all the time?
It is a null terminated sequence of utf-8 characters. And only
utf-8! You need to convert strings encoded in different ways to
utf-8 before passing them to the API. This can be accomplished
with the iconv library for instance.
Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding). It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.
Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 tseaver at palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow
GBTndXG+0Gw9OnAZeNCxADs=
=Yr7F
-----END PGP SIGNATURE-----
More information about the Zope3-dev
mailing list