[Zope-dev] SearchIndex Splitter lowercase indexes?
E. Seifert
e.seifert@gmx.net
Fri, 25 May 2001 09:17:28 +0200
Hi Michel,
Michel Pelletier wrote:
>The splitter should really be a modular component. That's what
>vocabularies were origninally for, to store language specific artifacts
>like word lists and splitters. For example, stripping the "ing" suffix
>obviously only makes sense in English. so if you want to change this
>behavior, make your own vocabulary with its own custom splitter.
>
>This is because each language has very different splitting requirements,
>and even different meanings of the word "word". Imagine, for example,
>splitting Japanese or one of the Chinese languages (based textualy on
>Kanji).
Just imagine German! There are composite words without spaces or other
non-aphanumeric characters between them.
>Identifying words in Kanji is a very hard problem. In romance langauge,
>it's easy, words are seperated by spaces, but in Kanji words are
>diferentiated by the context of the surrounding characters, there are no
>"spaces". Splitting Kanji text requres a pre-existing dictionary and some
>interesting heuristic matching algorithms. And that's only half of
>Japanese itself, really, since there are two other alphabets (hiragana and
>katagana) that *are* character-phonetic like romance langauges, and all
>three alphabets are commonly mixed together in the same sentence! Chinese
>language may also have these phonetic alphabets.
The same applies for German: You'd need a huge dictionary with word stems,
exceptions, and stop words.
Stems of many words change in different cases, too.
>In other words, it's not an easy problem! There is going to be an
>unimaginable culture clash when asian and other non-romance languages
>catch up to the volume of romance language content on the web.
Well, English or German in fact aren't romance languages, they're germanic
:-)
Eric