Problems with unicode and other encodings.
Hi List, I don't know much about encodings, so please excuse any ignorance I display. I am having a problem with encodings. I have restructuredText objects, and text fields in a database that are interpreted as restructuredText. Either of these can be edited by lay people, and in many cases, will be copied wholesale from Word. This produces a problem when these pages are viewed, because they contain improper characters, such as the ones mentioned in this article: http://effbot.org/zone/unicode-gremlins.htm This article presents a way to convert these characters to Unicode, which seems to work quite well, in and of itself. However, if I retrieve database fields, convert them, and then attempt to reinsert them, with the unicode characters, I get a UnicodeEncodeError, because it is attempting to encode these characters as ascii before inserting them in the database. What are possible solutions to these problems? Is there are standard practice that needs to be followed? Should I maintain the data as it is, and simply convert it to unicode before display? Alternatively, should I enforce a policy where those characters cannot be used? Is unicode the encoding I should be using? Thanks for any help. Alec Munro
Hi Alec, as you discovered, character encodings can be a pain. Before I give any advice, let us know some more about your application. - Do your users use more then one language? - Do your users use more then one encoding? (Russian, Chinese, ... have several encodings for the same language.) - Have you specified which encoding Zope should use when answering requests? - What database do you use? - What is the default encoding for your database? What encoding is the data stored in? - Did you specify your client encoding for the database connection? Usually a browser will send data back in the encoding the page containing the form was encoded. So it would be a good idea to specify the encoding in your page header. Please check your web server set up too, sometimes the web server specifies the encoding in the http header. Use Unicode strings in your scripts and don't return printed, return a Unicode string. That will allow Zope to convert the string to the desired encoding to answer a request. Databases often will convert encodings too. From there storage to the clients encoding and vice versa. Therefor it is important to specify both correctly. If you use only one encoding try to use it everywhere, even as database encoding. Otherwise it would be a good idea to use UTF-8 as database encoding. Ulrich -- Ulrich Wisser RELEVANT TRAFFIC SWEDEN AB, Riddarg 17A, SE-114 57 Sthlm, Sweden Direct (+46)86789755 || Cell (+46)704467893 || Fax (+46)86789769 ________________________________________________________________ http://www.relevanttraffic.com
Alec Munro wrote at 2004-11-16 12:18 -0400:
... http://effbot.org/zone/unicode-gremlins.htm
This article presents a way to convert these characters to Unicode, which seems to work quite well, in and of itself. However, if I retrieve database fields, convert them, and then attempt to reinsert them, with the unicode characters, I get a UnicodeEncodeError, because it is attempting to encode these characters as ascii before inserting them in the database.
What are possible solutions to these problems?
Few systems are ready to store unicode directly. Usually, you must choose an encoding when you want to store unicode text. Thus, decide which encoding your database should use. When you store unicode text, you use the chosen encoding to encode the text; when you retrieve text from the database, you use the encoding to convert back to unicode. -- Dieter
participants (3)
-
Alec Munro -
Dieter Maurer -
Ulrich Wisser