Tobias Kiesling wrote:
Hello all,
I've got some problems with indexing in ZCatalog.
First of all it doesn't seem possible to index words containing anyl non-ascii characters (most important german umlauts). On zope.org the only info on charactersets I found was, that there are some supported ones, but I couldn't find out which these are.
How is it possible to have a correct indexed catalog on a german site ?
You must your locale environment variable correctly. See 'man locale' on linux, for example. Also, you can pass z2.py the -L argument to explicitly set a locale. This will allow you to index german characters.
The second problem is related with the first one, but not restricted to german characters. Words where a hyphen appears, e.g. names like 'Hans-Dietrich', aren't indexed at all, in the example 'Hans-Dietrich' cannot be found on the site, but 'Hans' or 'Dietrich' cannot be found, either.
Hyphens are hard coded in the text index to split words. This is because the text splitter is not very flexible and is very english centric. To fix this, you must either 1) create your own splitter, 2) update to the latest CVS and create your own vocabulary and provide a new splitter. The CVS version of Catalog is much more language neutral, and there are now two kids of vocabulary object, English and Globbing-English. Soon someone I'm in contact with will be providing a Japanese vocabulary. If you have the patience, you should developer a German vocabulary to address these issue. -Michel