[Zope-dev] Re: [Archetypes-devel] Unicode in Zope 2 (ZMI, Archetypes, Plone, Formulator)

Mon Apr 26 06:39:10 EDT 2004

Bjorn Stabell wrote:
>>--On Montag, 26. April 2004 10:53 Uhr +0200 David Convent 
>><david.convent at naturalsciences.be> wrote:
>>
>>
>>>I always believed that unicode and utf-8 were same encoding, but 
>>>reading you let me think i was wrong.
>>>Can you tell me what the difference is between unicode and utf-8 ?
>>
> 
> Andreas Jung wrote: 
> 
>>Unicode is common database for almost all characters. UTF-8 
>>is an *encoding* that allows you to represent any element of 
>>this character database as set for 1,2,3 or 4 bytes. There 
>>are also other encoding e.g. like UTF16 that encode an 
>>element in a different way....so we are talking about 
>>completely different things.
> 
> 
> Yes, the difference is that Python has a whole different understanding of
> Unicode strings (type(u"")) than it has of text of some character encoding
> (e.g., UTF-8, GB18030, ISO8859-1, ASCII, stored as type("")).  Python will
> of course represent these unicode strings internally some way (maybe as a
> 16-bit integer?), but we don't need to know what that is like.  All we need
> to know is that this is a string that can contain any character on the
> planet, and that we can reasonably expect normal text operations to work on
> it.
> 
> UTF-8 is, similar to ISO-8869-1 (latin1), just a character encoding.  It
> (and UTF16, UCS2, UCS4) is only special in that it was issued by the Unicode
> consortium and can encode any Unicode character.  Wherease ISO-8859-1 (for
> example), being only 8 bits, can only encode characters used in Western
> Europe.  GB18030, to take another extreme, is a 32-bit encoding endorsed by
> the Chinese govnerment; being 32-bit, it can encode/represent a lot of
> Unicode characters, even many non-Chinese ones; it is big enough to
> potentially encode any Unicode character, if the Chinese government defined
> how each Unicode code point was mapped into GB18030.  In this case, it would
> be similar in function to UCS4 (I think it is).
> 
> Internally, we want to work with Unicode strings (where str[4] is the 4th
> character) instead of UTF-8 encoded text strings (where str[4], being the
> 4th byte, has little semantic meaning).
> 
And to illustrate this by way of an example consider this Python
session (copied from a recent posting on plone.devel but included
here again for the records)

<begin:quote>
This is a common missunderstanding when it comes to
unicode in Python.

Consider

string1 = u"This is a unicode string"

string2 = string1.encode('utf-8')

Here, type(string1) = unicode whereas type(string2) = string,
i.e., string1 is a proper Python unicode string object whereas
string2 is a utf-8 encoded proper Python string object.

Or consider the following Python session:

[ritz at hardy ritz]$ python
Python 2.3.3 (#5, Dec 30 2003, 15:25:24)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> s1 = u"äüö"
 >>> print repr(s1)
u'\xc3\xa4\xc3\xbc\xc3\xb6'
 >>> s2 = s1.encode('utf-8')
 >>> print repr(s2)
'\xc3\x83\xc2\xa4\xc3\x83\xc2\xbc\xc3\x83\xc2\xb6'
 >>> type(s1)
<type 'unicode'>
 >>> type(s2)
<type 'str'>
 >>>

Maybe that clarifies things a bit?
<end:quote>

Raphael

> Bye,