SearchIndex Splitter lowercase indexes?
Hi Michel, Michel Pelletier wrote:
The splitter should really be a modular component. That's what vocabularies were origninally for, to store language specific artifacts like word lists and splitters. For example, stripping the "ing" suffix obviously only makes sense in English. so if you want to change this behavior, make your own vocabulary with its own custom splitter.
This is because each language has very different splitting requirements, and even different meanings of the word "word". Imagine, for example, splitting Japanese or one of the Chinese languages (based textualy on Kanji).
Just imagine German! There are composite words without spaces or other non-aphanumeric characters between them.
Identifying words in Kanji is a very hard problem. In romance langauge, it's easy, words are seperated by spaces, but in Kanji words are diferentiated by the context of the surrounding characters, there are no "spaces". Splitting Kanji text requres a pre-existing dictionary and some interesting heuristic matching algorithms. And that's only half of Japanese itself, really, since there are two other alphabets (hiragana and katagana) that *are* character-phonetic like romance langauges, and all three alphabets are commonly mixed together in the same sentence! Chinese language may also have these phonetic alphabets.
The same applies for German: You'd need a huge dictionary with word stems, exceptions, and stop words. Stems of many words change in different cases, too.
In other words, it's not an easy problem! There is going to be an unimaginable culture clash when asian and other non-romance languages catch up to the volume of romance language content on the web.
Well, English or German in fact aren't romance languages, they're germanic :-) Eric
participants (1)
-
E. Seifert