[REQ] Support for multi-lingual components of TextIndexNG wanted
Hi folks, the next version of TextIndexNG will focus on multi-lingual issues (and has full unicode support). I need some support from the community for components that are language-dependent: - stopwords Stopwords are words that are removed during the indexing process because they are very common e.g. 'a', 'the', 'for' in English - normalization Normalization means the translation of special characters or a sequence of characters to a more simpler form, e.g. 'Ä' -> 'Ae', 'ä' -> 'ae', ´ß' -> 'ss' or a more radical reduction like 'Ä' -> 'A', 'ä' -> 'a', ´ß' -> 's'. Such a reduction allows more fault tolerant searching. At the moment TextIndexNG supports only German and English. If you like to see more languages supported by TextIndexNG, feel free to contribute lists with stopwords of your language and/or translation rules for the normalization step. Thanks, Andreas
On Mon, Jun 17, 2002 at 08:51:26AM -0400, Andreas Jung wrote:
- normalization
Normalization means the translation of special characters or a sequence of characters to a more simpler form, e.g. 'д' -> 'Ae', 'Д' -> 'ae', ╢ъ' -> 'ss' or a more radical reduction like 'д' -> 'A', 'Д' -> 'a', ╢ъ' -> 's'. Such a reduction allows more fault tolerant searching.
At the moment TextIndexNG supports only German and English.
What about non-iso8859 languages? How can I create normalization rules if my language does not have any mapping to latin alphabet? Oleg. -- Oleg Broytmann http://www.zope.org/Members/phd/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
--On Monday, June 17, 2002 17:03 +0400 Oleg Broytmann <phd@phd.pp.ru> wrote:
What about non-iso8859 languages? How can I create normalization rules if my language does not have any mapping to latin alphabet?
In the current implementation normalizers can be specified through a text file. Inside the file you can declare the language and the used encoding, e.g. # german normalizer # $Id: de.txt,v 1.2.2.1 2002/06/13 12:50:08 ajung Exp $ # language = german # encoding = iso-8859-1 Ä Ae Ö Oe Ü Ue ä ae ö oe ü ue ß ss When the file is parsed every rule is translated to unicode using the specified encoding. -aj
participants (3)
-
Andreas Jung -
Andreas Jung -
Oleg Broytmann