SUMMARY: strange unicode behaviour, unicode and ZCTextIndex
I post this for the mail archives. Everything is pretty obvious, but I think it can save some headaches to others. ABSTRACT: I hit some unicode related problems when testing a unicode aware, multilingual xml repository Zope product. What follows is what I learned and what is working (which now is, luckily enough, almost everything). T The following has been tested on Zope 2.6.1/Py2.1.3 installed from binaries on win32 (but it should be platform independent). THE QUICK ROUTE 1. have a sys.setdefaultconfig in sitecustomize.py 2. use RESPONSE.setHeader('content-type','text/html;charset=<dtml-var your_preferred_encoding>')"> in _both_ your user _and_ ZMI pages 3. let python/Zope do the encoding/decoding for you (i.e. don't use .encode(your_preferred_encoding), unless you know what you are doing) 4. You don't have to start Zope with any locale to have ZCTextIndex work nicely with unicode content. FOR THE BRAVE CURIOUSES **1. sys.setdefaultconfig** <quote from="Toby Dickenson"> <original msg from="Giuseppe Bonelli"> I have utf-8 as sys.defaultencoding and I do not load any locale when starting Zope. </original mg> That is old advice that predates Zope 2.6. It was never a particularly good idea, because it affects all of pythons internals. You only need to encode your unicode as utf-8 (or other encoding) before sending it over the network, and ZPublisher is capable of doing that itself if you tell it the encoding in the header. </quote> That's true, but you will definitely need to set a default encoding if you are going to use python code. If not set, the default encoding is ascii and you will get the usual "encode error, ordinal not in range (128)" error when doing as a simple thing as Print string_with_some_special_chars_inside. To set a default encoding: create (or edit) a sitecustomize.py file in your zope_install_dir\bin\lib (or in the phyton used by zope) and use: import sys sys.setdefaultencoding(my_encoding) ***2. content-type This is trivial for your user interface pages: just add <dtml-call "RESPONSE.setHeader('content-type','text/html;charset=<dtml-var your_preferred_encoding>')"> in the <head/> of your standard_html_header. I found it non trivial for the ZMI pages, as I discovered that the default encoding in the ZMI pages is governed by a variable named management_page_charset, which has a default of iso-8859-1 (and I was using utf-8 for automatically generated titles properties ...). If you need to change this default you can use a property named management_page_charset in the top folder of your app. This works, but is not future proof (see manage_page_header source under lib/python/app/dtml for details on this). The best option would probably be to use a <dtml-call "REQUEST.set('management_page_charset','your_preferred-encoding')">. Why REQUEST.set and not just use a meta "http_equiv=content-type" ? <quote from="Tino Wildenhain"> <original msg from="Dieter Maurer"> I never understood why the meta "http_equiv=content-type" did not work, just recognized that it did not work reliably. </original> This influencing of HTTP-headers via HTML is very problematic because 1) there are often real HTTP-headers, there seems to be no definition which takes precedence over the other 2) Downstream proxys cannot read HTML embedded HTTP-header, but base their caching strategy on the real headers. This will sometimes lead to confusing experiences In general, if you have control over the real HTTP headers, you should use it and not include something like that in HTML. With zope we are in the happy position to have control as opposite to a "web-business-card" where you just dump a couple of HTML files onto a hosters server. A patched ZPT could transport information from HTML meta to REQUEST... interesting idea. </quote> ***3. ZCTextIndex My original ZCTextIndexes problems were due to a combination of above and to leftover words from indexes removed during testing (Heisenberg Uncertainty Principle applied to s/w at play here: during testing, if you change something the testing itself is influencing the system). If you still experience problems, delete the lexicon, recreate it and reindex. If problems persist, double check that you are not mixing unicode/non unicode content in your indexes (if you followed quick route #3 above, this should not be the case). (H)ACKNOWLEDGMENTS Thanks to all who helped (Dieter, Tino, Toby, Hannu, Hugo). END NOTE As always, a debugging session is not fun, but you end up with some new python/Zope insights. __peppo
On Friday 25 July 2003 11:09, Giuseppe Bonelli wrote:
**1. sys.setdefaultconfig** <quote from="Toby Dickenson"> <original msg from="Giuseppe Bonelli"> I have utf-8 as sys.defaultencoding and I do not load any locale when starting Zope. </original mg> That is old advice that predates Zope 2.6. It was never a particularly good idea, because it affects all of pythons internals. You only need to encode your unicode as utf-8 (or other encoding) before sending it over the network, and ZPublisher is capable of doing that itself if you tell it the encoding in the header. </quote>
That's true, but you will definitely need to set a default encoding if you are going to use python code. If not set, the default encoding is ascii and you will get the usual "encode error, ordinal not in range (128)" error when doing as a simple thing as Print string_with_some_special_chars_inside.
you wont get any error when doing.... print string_with_some_special_chars_inside ... but you may if doing..... print unicode_string_with_some_non_ascii_chars_inside I find this inconvenient too, but that is the way that the Python language is defined. There is code inside Zope, and other libraries, that assumes Python behaves this way. Your call to sys.setdefaultencoding will break these libraries, because they change the Python language globally. You need.... print unicode_string_with_some_non_ascii_chars_inside.encode('utf-8') -- Toby Dickenson - http://www.geminidataloggers.com/people/tdickenson Want a job like mine? http://www.geminidataloggers.com/jobs for Software Engineering jobs at Gemini Data Loggers in Chichester, West Sussex, England
Toby Dickenson wrote at 2003-7-25 12:32 +0100:
... ...
That's true, but you will definitely need to set a default encoding if you are going to use python code. If not set, the default encoding is ascii and you will get the usual "encode error, ordinal not in range (128)" error when doing as a simple thing as Print string_with_some_special_chars_inside. ... you wont get any error when doing.... print string_with_some_special_chars_inside ... but you may if doing..... print unicode_string_with_some_non_ascii_chars_inside
I find this inconvenient too, but that is the way that the Python language is defined. There is code inside Zope, and other libraries, that assumes Python behaves this way.
I do not want to believe this. Can you give an example? What do these libraries do when they get a UnicodeError encoding exception?
Your call to sys.setdefaultencoding will break these libraries, because they change the Python language globally.
I live in the "iso-8859-1" area and have (accordingly) defined the default encoding as "iso-8859-1". I did not met any library that has had problems with this -- neither Zope nor any other Python library I am using. Due to this default encoding, I save me from myriads of encoding errors and make interactive debugging feasible. Surely, you will understand, that I do not want to add an "encode('iso-8859-1')" to any value I output with "print" during interactive debugging. Dieter
On Friday 25 July 2003 19:29, Dieter Maurer wrote:
I find this inconvenient too, but that is the way that the Python language is defined. There is code inside Zope, and other libraries, that assumes Python behaves this way.
I do not want to believe this. Can you give an example?
pDocumentTemplate contains this code below which relies on 'string'.join raising a UnicodeError exception if a list contains a mix of unicode strings and non-ascii plain strings: def join_unicode(rendered): """join a list of plain strings into a single plain string, a list of unicode strings into a single unicode strings, or a list containing a mix into a single unicode string with the plain strings converted from latin-1 """ try: return ''.join(rendered) except UnicodeError: # A mix of unicode string and non-ascii plain strings. # Fix up the list, treating normal strings as latin-1 rendered = list(rendered) for i in range(len(rendered)): if type(rendered[i]) is StringType: rendered[i] = unicode(rendered[i],'latin-1') return u''.join(rendered) [note that actually Zope uses the "C optimised" cDocumentTemplate alternative, which contains equivalent logic]
Due to this default encoding, I save me from myriads of encoding errors and make interactive debugging feasible. Surely, you will understand, that I do not want to add an "encode('iso-8859-1')" to any value I output with "print" during interactive debugging.
Yes. I work in a mostly utf-8 world, and I originally wanted Python's unicode support to work somewhat like you are using it when first pioneering unicode in Zope. Guido convinced me otherwise: http://aspn.activestate.com/ASPN/Mail/Message/i18n-sig/581409 (last paragraph in particular)
I did not met any library that has had problems with this -- neither Zope nor any other Python library I am using.
I think that is similar to how many people used those pentiums with the fdiv bug without noticing a problem. I hope this helps, -- Toby Dickenson - http://www.geminidataloggers.com/people/tdickenson Want a job like mine? http://www.geminidataloggers.com/jobs for Software Engineering jobs at Gemini Data Loggers in Chichester, West Sussex, England
Toby Dickenson wrote at 2003-7-29 19:02 +0100:
On Friday 25 July 2003 19:29, Dieter Maurer wrote:
... Due to this default encoding, I save me from myriads of encoding errors and make interactive debugging feasible. Surely, you will understand, that I do not want to add an "encode('iso-8859-1')" to any value I output with "print" during interactive debugging.
Yes. I work in a mostly utf-8 world, and I originally wanted Python's unicode support to work somewhat like you are using it when first pioneering unicode in Zope. Guido convinced me otherwise:
http://aspn.activestate.com/ASPN/Mail/Message/i18n-sig/581409 (last paragraph in particular)
I read the explanation but I am not convinced. While I agree fully that modules and packages destined to be used world wide (such as Zope) should not make any assumptions about the default encoding (I think, "[cp]DocumentTemplate" is buggy in this respect), I feel strongly that Python should provide means to determine the default encoding and do not fix it to an US standard. I live in an "ISO-8859-15" world. Terminal, file system, servers all use this encoding. Especially for interactive use of the Python interpreter, it would be *really* nasty to have each (potential) unicode string explicitely encoded in the *true* default encoding of my environment. Python currently restricts the use of "setdefaultencoding" to the initialization time. I can live with this restriction as I would not change it afterwards, anyway. I faintly remember that Guido has reservations about locale support. Nevertheless, it is a good step to let software adapt to the environment it is used in. In my view, the default encoding is a similar device which allows Python's Unicode support to adapt to the defaults employed where Python programs are executed (rather than favour US usage only).
I did not met any library that has had problems with this -- neither Zope nor any other Python library I am using.
I think that is similar to how many people used those pentiums with the fdiv bug without noticing a problem.
I know that there is no device completely faultless. Nevertheless, I am using many of them. I would not have been worried when I had observed that I use a Pentium with the "fdiv" bug (as I do not control power plants, airline or similar sensible systems but just develop software. The worst thing which may happen it that I spend some hours trying to find the reason for an apparently unexplainable behaviour). Dieter
participants (3)
-
Dieter Maurer -
Giuseppe Bonelli -
Toby Dickenson