[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen
faassen at startifact.com
Tue Jan 16 08:12:46 EST 2007
Andreas Jung wrote:
>
>
> --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen
[snip]
>>> I still don't see what should ambiguous with this approach.
>>
>> Ambiguous in that the string seems to say it's in two encodings at once.
>> You're then "guessing": you're letting the Python string type trump the
>> declaration. Then, since we've shown that leads to bugs, you propose
>> actually change the encoding declaration of the XML document. I wonder
>> what people then expect to happen upon serialization. In effect, your
>> proposal would, I think, serialize to UTF-8 only, right? (in which case
>> the encoding declaration can be dropped as it's the default.
>
> When you download a ZPT through FTP/WebDAV then the unicode representation
> of the XML will be converted using the 'output_encoding' property of the
> corresponding ZPT which is set when uploading a new XML document (and taken
> from the premable). So when you upload an latin1 XML file you should get
> it back as valid latin1 through FTP/WebDAV.
Okay, understood, this makes sense in the case of the FTP/WebDAV
support, though recoding to UTF-8 and ripping off the encoding
declaration would also be pretty safe in case of XML.
> When you download text/xml content through the ZPublisher then the
> ZPublisher will convert unicode textual content to some encoding which is
> either taken from an already set 'content-type: text/...; charset=XXXXX'
> HTTP Header or as fallback from the zpublisher-default-encoding property
> as defined in the zope.conf file.
And the same behavior actually applies to HTML content, right?
> So the application can specify in both case the encoding of the serialized
> XML content. Where is the problem?
What I'm trying to express here is that this stuff should not be treated
as "where is the problem?" but should be thought through carefully as
this is extremely easy to do wrong. I'll think it through carefully
here. Let's list some cases:
A) FTP download: stored ML gets downloaded through FTP/WebDAV support.
B) FTP upload: external XML gets uploaded through FTP/WebDAV
C) parse: stored XML is parsed inside of Zope by the page template engine.
D) publisher download: stored XML is downloaded as text/xml directly
through the publisher
E) ZPT inclusion: stored XML is included in another page template, for
instance to present it in a text area.
F) form submit: Text area is then saved and needs to be stored again.
Andreas Jung proposal (speculation)
===================================
As far as I understand it you're proposing:
* store XML as unicode text
* separately store the encoding on the page template object
* also keep the encoding="" bit in the XML preamble when storing.
Let's go through the cases
A) FTP download: encode this to whatever encoding is stored on the ZPT
object using Python unicode support. No encoding mangling necessary.
B) FTP upload: read encoding="" bit and store this on ZPT. Then decode
to unicode using that encoding. Could not be implemented by a
parse/serialization step without extra encoding="" manipulation
afterwards (after decoding to unicode).
C) parse: Rip out the 'encoding=""' bit before you send it in the
parser. encode to UTF-8 just before entering the parser.
D) publisher download: Rip out the 'encoding=""' bit. Then encode
according to response header (or zope.conf). Then add back encoding=""
bit stating if output is non-UTF-8 (not Python names like 'latin1' but
encoding identifiers XML is aware of).
E) ZPT inclusion: Send the unicode text to the page template.
encoding="" bit will be presented in the editor.
F) form submit: decode to unicode according to encoding of page that
displayed edit form and store it. Read 'encoding=' bit and store it in
ZPT object. Don't manipulate 'encoding=""' bit in XML.
encoding="" removal: C, D
encoding="" adding: D
encoding="" reading: B, F
encode from unicode: A, C, D
decode to unicode: B, F
no encoding="" manipulation required: A, E
no recoding required: E
straightforward: E
The forms editor scenario (E and F) is potentially confusing as the user
may be tempted by the ability to use encoding="" to paste latin-1 XML
text. Editor could say it only wants it in whatever encoding the page is
in, though.
Martijn Faassen proposal
========================
If you rip out the encoding before data is stored in the page template
and then store as unicode, then we have the following cases:
A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding
mangling necessary.
B) FTP upload: read encoding="" bit and decode to unicode accordingly.
Rip out encoding="". Could be done by a parse/serialization step, then
decode result to unicode.
C) parse: encode to UTF-8 just before entering the parser.
D) publisher download: Encode according to response header or zope.conf.
Add in encoding="" if output is non-UTF-8 using XML names for encoding.
E) ZPT inclusion: send unicode text to the page template. No encoding=""
bit will be in the XML presented in the editor.
F) form submit: Rip out any encoding="" before storing, ignoring it as
XML was in output encoding, then convert to unicode using input encoding.
encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: A, C, D
decode to unicode: B, F
no encoding="" manipulation required: A, C, E
no recoding required: E
straightforward: E
No storage of encoding information on ZPT object is necessary.
Case B) potentially confusion as upon re-download XML document will be
recoded to UTF-8 (though XML editors should be able to deal with this as
it's the default).
Form edit still potentially confusing as encoding="" bit disappears, but
at least suggestion to user is not made that information *presented* in
a textarea is in a particular encoding specified in the encoding="" bit.
Tres Seaver proposal (speculation)
==================================
Storage in UTF-8.
A) FTP download: output in UTF-8 only, can be done directly.
B) FTP upload: read encoding="" bit and, if not UTF-8, decode to unicode
accordingly. Then recode to UTF-8. Rip out encoding="". Could be done by
an XML parse/serialization step.
C) parse: just pass UTF-8 to parser.
D) publisher download: Decode to unicode. Then recode to desired output
encoding (with XML names for encoding added in encoding="") bit.
E) ZPT inclusion: Decode text to unicode. No encoding="" bit will be in
the XML presented in the editor.
F) form submit: Rip out any encoding="" before storing, ignoring it as
XML was in output encoding, then convert to unicode using that encoding,
then convert again to UTF-8.
encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: B, D, F
decode to unicode: B, D, F
no encoding="" manipulation required: A, C, E
no recoding required: A, C (B and F if UTF-8 uploaded)
straightforward: A, C
No storage of encoding information in ZPT object is necessary.
Case B) potentially confusion as upon re-download XML document will be
recoded to UTF-8 (though XML editors should be able to deal with this as
it's the default).
Form edit still potentially confusing as encoding="" bit disappears, but
at least suggestion to user is not made that information *presented* in
a textarea is in a particular encoding specified in the encoding="" bit.
Just store the XML text
=======================
Storage XML text literally as received. Maybe this is actually what Tres
meant. :)
A) FTP download: output can be done directly.
B) FTP upload: store input directly
C) parse: just pass text to parser.
D) publisher download: Decode to unicode using encoding="" bit. Remove
encoding bit. Then recode to desired output encoding (with XML names for
encoding added in encoding="") bit.
E) ZPT inclusion: Decode text to unicode using encoding="" bit.
F) form submit: Encode text in form from unicode according to
encoding="" bit.
encoding="" removal: D
encoding="" adding: D
encoding="" reading: B, D, E, F
encode from unicode: D, F
decode to unicode: D, E
no encoding="" manipulation required: A, C (but B, E, F only reading)
no recoding required: A, B, C
straightforward: A, C
No storage of encoding information in ZPT object is necessary, though
could be done to optimize extraction of encoding=""
Form edit potentially confusing as in Andreas Jung scenario.
..............
Any use cases I missed or got wrong? The scenarios are all complicated. :)
The "Andreas Jung" scenario has "leave the XML text alone except make it
unicode" goal in mind, but actually ends up messing about with
"encoding=""" more than the other scenarios.
The "Martijn Faassen" scenario tries to follow the rule: decode to
unicode on input, get rid of encoding="" in XML, and encode only on
output as much as possible, with the exception of the parser call.
The Tres Seaver scenario as I sketched it has the "turn the XML into
UTF-8" goal. It needs to do recoding less frequently than the other
scenarios, though more frequently than one would hope.
The "just store the XML" scenario is in surprisingly nice. It only needs
attention to encoding and decoding in the always complicated ZPublisher
direct output scenario, and in the edit form scenario.
The "just store XML" proposal starts to look attractive. It requires
very little actual XML text manipulation, only in D, and while it does
require more reading of the encoding="" bit, this can be cached and at
least doesn't require string manipulation. Care can be taken that there
is an API to represent the XML as unicode strings - this is done for
display purposes only (clearly human readable text) and this is the only
case where the encoding="" bit is rather misleading.
Regards,
Martijn
More information about the Zope3-dev
mailing list