According to Giuseppe Bonelli:
sorry if this is not zope specific, but can someone please explain to me the following behaviour when trying to convert an iso-8859-1 string read from a file to an utf-8 encoded one?
s='\x93test\x94' #an iso-8859-1 string #\x93 and \x94 are left and right #double quotation marks, #as seen in a browser set to iso-8859-1
\x93 and \x94 are *not* iso-8859-1 quotation marks. See for example http://en.wikipedia.org/wiki/ISO_8859-1 Instead they seem to be from the Windows-125X (X=0,1,...) codepage: http://www.microsoft.com/globaldev/reference/sbcs/1250.mspx
ss=unicode(s,'iso-8859-1').encode('utf-8') gives ss='\xc2\x93test\xc2\x94' which is wrong (as seen in a browser set to utf-8)!
but:
unicode(s,'cp1250').encode('utf-8') '\xe2\x80\x9ctest\xe2\x80\x9d'
is right.
Do I have to explicitly replace all characters above \x7F ?
No, you have to use the right encodings ;-) \wlang{} -- Willi.Langenberger@wu-wien.ac.at Fax: +43/1/31336/9207 Zentrum fuer Informatikdienste, Wirtschaftsuniversitaet Wien, Austria