Zope 2.6.1 and UTF-8
Trying again to bring it on list ;-) Chris Withers wrote:
(bringing on-list in case others are interested)
Toby Dickenson wrote:
I've got some stuff that's in strings, so I guess not unicode, but which is UTF-8 encoded, and I'm wondering how I make sure Zope does "the right thing" here. Are there any docs about?
(and just to be clear, I'm using Zope 2.6.1 with ZODB 3.1, what differences will that make?)
Ive submitted a chapter to one of the books that Chris M maintains... last I looked it still wasnt merged :( There is some info at http://zope.org/Members/htrd/howto/unicode http://zope.org/Members/htrd/howto/unicode-zdg-changes
Just had a read of these, very interesting...
1. convert your strings to either unicode objects or latin-1, so that dtml or zpt can do the right thing when combining them. (Ive *still* not used zpt for this, but I assume it works).
I will be using ZPT for this, what changes did you make so that ZPT's return unicode strings?
I recommend converting all language strings to unicode at the earliest opportunity as a general principal.
Hmmm, that's interesting. I'd been planning on keeping everything as UTF-8 encoded strings rather than actual unicode. What leads you to suggest storing everything as unicode?
2. set a 'Content-Type' header with the value 'text/html; charset=UTF-8' (or whatever you prefer, but anything other than utf8 has other complications) so that ZPublisher knows how to transmit the unicode response over http.
What are these complications? (luckily I'm going to be using UTF-8 ;-)
3. If there are http forms on those pages, you need to add extra marshalling tags so that ZPublisher knows what encoding your browser used when submitting the form.
If I do, do I then end up with unicode or strings encoded with the character set I specify?
Finally, is ZCTextIndex compatible with either unicode or strings that contain UTF-8 encoding?
cheers,
Chris
On Wednesday 10 September 2003 15:46, Chris Withers wrote:
Trying again to bring it on list ;-)
Chris Withers wrote:
(bringing on-list in case others are interested)
Toby Dickenson wrote:
I've got some stuff that's in strings, so I guess not unicode, but which is UTF-8 encoded, and I'm wondering how I make sure Zope does "the right thing" here. Are there any docs about?
(and just to be clear, I'm using Zope 2.6.1 with ZODB 3.1, what differences will that make?)
Ive submitted a chapter to one of the books that Chris M maintains... last I looked it still wasnt merged :( There is some info at http://zope.org/Members/htrd/howto/unicode http://zope.org/Members/htrd/howto/unicode-zdg-changes
Just had a read of these, very interesting...
1. convert your strings to either unicode objects or latin-1, so that dtml or zpt can do the right thing when combining them. (Ive *still* not used zpt for this, but I assume it works).
I will be using ZPT for this, what changes did you make so that ZPT's return unicode strings?
I didnt, but I believe someone was reproducing my dtml semantics in ZPT. I forget who was working on this......
I recommend converting all language strings to unicode at the earliest opportunity as a general principal.
Hmmm, that's interesting. I'd been planning on keeping everything as UTF-8 encoded strings rather than actual unicode. What leads you to suggest storing everything as unicode?
Its a question of choosing the right data type to represent your data. Doesnt it make sense for string methods, character indexing, etc, to work on your data as a sequence of unicode characters? You wouldnt consider using an 8-bit string to store something that is logically an integer, simply because you originally read it from a file or socket in 8-bit string form. Why do the same to a unicode string? (perl programmers need not reply ;-)
2. set a 'Content-Type' header with the value 'text/html; charset=UTF-8' (or whatever you prefer, but anything other than utf8 has other complications) so that ZPublisher knows how to transmit the unicode response over http.
What are these complications? (luckily I'm going to be using UTF-8 ;-)
The rules for working out what encoding a browser will use when submitting a form are complicated, and depend on the encoding of the page that contained the form, POST/GET, and browser version. If your pages use UTF-8 then *all* form submissions come back in UTF-8. IMO its a no-brainer choice if you have forms (or might ever add one).
3. If there are http forms on those pages, you need to add extra marshalling tags so that ZPublisher knows what encoding your browser used when submitting the form.
If I do, do I then end up with unicode or strings encoded with the character set I specify?
You get to choose the right data type..... If you want to receive a unicode string from a form that will be submitted by the browser in utf8, then use <input name="description:utf8:ustring"..... If you want to receive a plain string containing latin-1 characters from a form that will be submitted by the browser in utf8, then use <input name="postcode:utf8:string"..... If you want to receive the bytes as the browser sent them over the wire: <input name="idontknowwhatthiswouldbefor:string".....
Finally, is ZCTextIndex compatible with either unicode or strings that contain UTF-8 encoding?
No idea. -- Toby Dickenson
Most of this is discussion is over my head. But there's one pretty basic misunderstanding exhibited:
I've got some stuff that's in strings, so I guess not unicode, but which is UTF-8 encoded, and I'm wondering how I make sure Zope does "the right thing" here. Are there any docs about?
and
Hmmm, that's interesting. I'd been planning on keeping everything as UTF-8 encoded strings rather than actual unicode. What leads you to suggest storing everything as unicode?
and
Finally, is ZCTextIndex compatible with either unicode or strings that contain UTF-8 encoding?
UTF-8 is one way of encoding Unicode character-sets. They are not different things. When you use UTF-8 you are using Unicode. UTF-8 exists to allow systems to migrate gently, because it translates a very large character set into a format that will not normally break file systems which expect 8-bit character data. There are 16-bit and 32-bit representations of Unicode. UTF-8's representation of ASCII is identical to ASCII's. So for applications which internally process only ASCII, the encoding is moot. But if you have user input you need to watch out: Windows and MacOS support UTF-8 input in browser windows for forms input. This input can seriously break old apps. best Mark Barratt
On Wednesday 10 September 2003 16:56, Mark Barratt wrote:
Hmmm, that's interesting. I'd been planning on keeping everything as UTF-8 encoded strings rather than actual unicode. What leads you to suggest storing everything as unicode?
UTF-8 is one way of encoding Unicode character-sets. They are not different things.
I think Chris was referring to python's unicode type, rather than Unicode in general. He was planning to store his Unicode values utf8 encoded in plain 8-bit string objects, rather than as unicode objects
When you use UTF-8 you are using Unicode.
type(u'hello') <type 'unicode'> type(u'hello'.encode('utf8')) <type 'str'>
But Im sure you knew that. -- Toby Dickenson
<quote who="Toby Dickenson">
On Wednesday 10 September 2003 16:56, Mark Barratt wrote:
When you use UTF-8 you are using Unicode.
type(u'hello') <type 'unicode'> type(u'hello'.encode('utf8')) <type 'str'>
But Im sure you knew that.
Nice of you to think so. In fact my ignorance about Python is almost unbounded. (I know enough to guess that there is some sort of programmer's joke potential in 'unbounded' but not enough to make it into one). best Mark Barratt : Text Matters Helping explain things using language, design, systems and process improvement phone +44 118 986 8313 web http://www.textmatters.com email markb@textmatters.com
Hi Chris,
-----Original Message----- From: zope-bounces@zope.org [mailto:zope-bounces@zope.org]On Behalf Of Chris Withers Sent: mercoledi 10 settembre 2003 16.46 Cc: Toby Dickenson; zope@zope.org Subject: [Zope] Zope 2.6.1 and UTF-8
Trying again to bring it on list ;-) [..snip..]
Finally, is ZCTextIndex compatible with either unicode or strings that contain UTF-8 encoding?
On this I don't have a definite answer, but I can tell what I learned during the development of a multilanguage xml CMS application where unicode issues and searching capabilities are critical: ZCTextIndex seems to work with unicode/encoded strings, but, occasionally, I got strange errors from the Lexicon when using form's field data encoded as utf-8 and containing non-plain ascii. Initially I thought it was a problem in my code, but after double checking all the encoding issues (exactly along the lines of what Toby says) I came to the conclusion that the Lexicon code has something broken when using utf-8 encoded strings (I didn't had time to test it with unicode types instead of encoded strings). I then gave a try to ZCTextIndexNG and all my problem got resolved automagically (i.e. the searches which gave errors using ZCTextIndex started to work as expected without any change in my python code). Moreover I found ZCTextIndexNGis faster and features richer than ZCTextIndex. For me, therefore, the conclusion is: use ZCTextIndexNG.
cheers,
Chris
Hope this helps, __peppo
participants (4)
-
Chris Withers -
Giuseppe Bonelli -
Mark Barratt -
Toby Dickenson