[Zope-dev]
Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone,
Formulator)
Raphael Ritz
r.ritz at biologie.hu-berlin.de
Mon Apr 26 06:39:10 EDT 2004
Bjorn Stabell wrote:
>>--On Montag, 26. April 2004 10:53 Uhr +0200 David Convent
>><david.convent at naturalsciences.be> wrote:
>>
>>
>>>I always believed that unicode and utf-8 were same encoding, but
>>>reading you let me think i was wrong.
>>>Can you tell me what the difference is between unicode and utf-8 ?
>>
>
> Andreas Jung wrote:
>
>>Unicode is common database for almost all characters. UTF-8
>>is an *encoding* that allows you to represent any element of
>>this character database as set for 1,2,3 or 4 bytes. There
>>are also other encoding e.g. like UTF16 that encode an
>>element in a different way....so we are talking about
>>completely different things.
>
>
> Yes, the difference is that Python has a whole different understanding of
> Unicode strings (type(u"")) than it has of text of some character encoding
> (e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type("")). Python will
> of course represent these unicode strings internally some way (maybe as a
> 16-bit integer?), but we don't need to know what that is like. All we need
> to know is that this is a string that can contain any character on the
> planet, and that we can reasonably expect normal text operations to work on
> it.
>
> UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding. It
> (and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode
> consortium and can encode any Unicode character. Wherease ISO-8859-1 (for
> example), being only 8 bits, can only encode characters used in Western
> Europe. GB18030, to take another extreme, is a 32-bit encoding endorsed by
> the Chinese govnerment; being 32-bit, it can encode/represent a lot of
> Unicode characters, even many non-Chinese ones; it is big enough to
> potentially encode any Unicode character, if the Chinese government defined
> how each Unicode code point was mapped into GB18030. In this case, it would
> be similar in function to UCS4 (I think it is).
>
> Internally, we want to work with Unicode strings (where str[4] is the 4th
> character) instead of UTF-8 encoded text strings (where str[4], being the
> 4th byte, has little semantic meaning).
>
And to illustrate this by way of an example consider this Python
session (copied from a recent posting on plone.devel but included
here again for the records)
<begin:quote>
This is a common missunderstanding when it comes to
unicode in Python.
Consider
string1 = u"This is a unicode string"
string2 = string1.encode('utf-8')
Here, type(string1) = unicode whereas type(string2) = string,
i.e., string1 is a proper Python unicode string object whereas
string2 is a utf-8 encoded proper Python string object.
Or consider the following Python session:
[ritz at hardy ritz]$ python
Python 2.3.3 (#5, Dec 30 2003, 15:25:24)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s1 = u"äüö"
>>> print repr(s1)
u'\xc3\xa4\xc3\xbc\xc3\xb6'
>>> s2 = s1.encode('utf-8')
>>> print repr(s2)
'\xc3\x83\xc2\xa4\xc3\x83\xc2\xbc\xc3\x83\xc2\xb6'
>>> type(s1)
<type 'unicode'>
>>> type(s2)
<type 'str'>
>>>
Maybe that clarifies things a bit?
<end:quote>
Raphael
> Bye,
More information about the Zope-Dev
mailing list