[Zope] SUMMARY: strange unicode behaviour, unicode and ZCTextIndex

25 Jul 2003

      I post this for the mail archives. Everything is pretty obvious, but I
think it can save some headaches to others.

ABSTRACT:
I hit some unicode related problems when testing a unicode aware,
multilingual xml repository Zope product. What follows is what I learned
and what is working (which now is, luckily enough, almost everything). T

The following has been tested on Zope 2.6.1/Py2.1.3 installed from
binaries on win32 (but it should be platform independent).

THE QUICK ROUTE
1. have a sys.setdefaultconfig in sitecustomize.py
2. use RESPONSE.setHeader('content-type','text/html;charset=<dtml-var
your_preferred_encoding>')"> in _both_ your user _and_ ZMI pages
3. let python/Zope do the encoding/decoding for you (i.e. don't use
.encode(your_preferred_encoding), unless you know what you are doing)
4. You don't have to start Zope with any locale to have ZCTextIndex work
nicely with unicode content.

FOR THE BRAVE CURIOUSES

**1. sys.setdefaultconfig**
<quote from="Toby Dickenson">
<original msg from="Giuseppe Bonelli">
I have utf-8 as sys.defaultencoding and I do not load any locale when
starting Zope.
</original mg>
That is old advice that predates Zope 2.6. It was never a particularly
good
idea, because it affects all of pythons internals. You only need to
encode
your unicode as utf-8 (or other encoding) before sending it over the
network,
and ZPublisher is capable of doing that itself if you tell it the
encoding in
the header.
</quote>

That's true, but you will definitely need to set a default encoding if
you are going to use python code. If not set, the default encoding is
ascii and you will get the usual "encode error, ordinal not in range
(128)" error when doing as a simple thing as
    Print string_with_some_special_chars_inside.

To set a default encoding:
create (or edit) a sitecustomize.py file in your
zope_install_dir\bin\lib (or in the phyton used by zope) and use:
    import sys
    sys.setdefaultencoding(my_encoding)

***2. content-type
This is trivial for your user interface pages: just add
    <dtml-call
"RESPONSE.setHeader('content-type','text/html;charset=<dtml-var
your_preferred_encoding>')">
in the <head/> of your standard_html_header.

I found it non trivial for the ZMI pages, as I discovered that the
default encoding in the ZMI pages is governed by a variable named
management_page_charset, which has a default of iso-8859-1 (and I was
using utf-8 for automatically generated titles properties ...).

If you need to change this default you can use a property named
management_page_charset in the top folder of your app. This works, but
is not future proof (see manage_page_header source under
lib/python/app/dtml for details on this).
The best option would probably be to use a
<dtml-call
"REQUEST.set('management_page_charset','your_preferred-encoding')">.

Why REQUEST.set and not just use a meta "http_equiv=content-type" ?
<quote from="Tino Wildenhain">
<original msg from="Dieter Maurer">
I never understood why the meta "http_equiv=content-type" did not
work, just recognized that it did not work reliably.
</original>
This influencing of HTTP-headers via HTML is very problematic
because
1) there are often real HTTP-headers, there seems to be no
    definition which takes precedence over the other
2) Downstream proxys cannot read HTML embedded HTTP-header, but
    base their caching strategy on the real headers. This will
    sometimes lead to confusing experiences

In general, if you have control over the real HTTP headers, you
should use it and not include something like that in HTML.
With zope we are in the happy position to have control as
opposite to a "web-business-card" where you just dump a couple
of HTML files onto a hosters server.

A patched ZPT could transport information from HTML meta
to REQUEST... interesting idea.
</quote>

***3. ZCTextIndex
My original ZCTextIndexes problems were due to a combination of above
and to leftover words from indexes removed during testing (Heisenberg
Uncertainty Principle applied to s/w at play here: during testing, if
you change something the testing itself is influencing the system).

If you still experience problems, delete the lexicon, recreate it and
reindex. If problems persist, double check that you are not mixing
unicode/non unicode content in your indexes (if you followed quick route
#3 above, this should not be the case).

(H)ACKNOWLEDGMENTS
Thanks to all who helped (Dieter, Tino, Toby, Hannu, Hugo).

END NOTE
As always, a debugging session is not fun, but you end up with some new
python/Zope insights.

__peppo