[Zope-dev] SearchIndex Splitter lowercase indexes?

Thu, 24 May 2001 15:02:26 -0700 (PDT)

On Thu, 24 May 2001, Christian Robottom Reis wrote:

> Hi, I've been testing SearchIndex's Splitter here, and I'm finding the
> behaviour only a tiny bit strange: it converts the words it splits to
> lowercase. Is this intentional? 

Yes.

>Example:
> 
> >>> import SearchIndex.Splitter
> >>> import SearchIndex.Lexicon
> >>> s = SearchIndex.Splitter.Splitter("Foo Bar Baz",
> 	SearchIndex.Lexicon.stop_word_dict)
> >>> s[0]
> 'foo'
> >>> s.indexes('foo')
> [0]
> 
> Why does this happen? 

This is a very common indexing strategy to save space and make searches
more relevant.  Otherwise 'Dog' and 'dog' would return two completely
different result sets.  

The splitter also removes single character words, splits words on
non-alphanumeric characters based on your locale (like -) and trims off
common english suffixes like 's' and 'ing' so that 'walk' and 'walking'
find the same words.  The splitter can also be passed a mapping of
synonyms, so you can tell the splitter that "automobile" "ford" and "lisp"
are all synonymous to the word "car".

> It makes TextIndex's position() call behave
> unexpectedly until you do some tests with the Splitter itself!

position() is currently unimplemented, isn't it?  so does it
matter?  Also, I don't know what your doing with position() but anytime
you want to look up things in a text index, use the same splitter to munge
the content before querying the index, otherwise, you may end up not
finding what you're looking for.

The splitter should really be a modular component.  That's what
vocabularies were origninally for, to store language specific artifacts
like word lists and splitters.  For example, stripping the "ing" suffix
obviously only makes sense in English.  so if you want to change this
behavior, make your own vocabulary with its own custom splitter.

This is because each language has very different splitting requirements,
and even different meanings of the word "word".  Imagine, for example,
splitting Japanese or one of the Chinese languages (based textualy on
Kanji).  

Identifying words in Kanji is a very hard problem.  In romance langauge,
it's easy, words are seperated by spaces, but in Kanji words are
diferentiated by the context of the surrounding characters, there are no
"spaces".  Splitting Kanji text requres a pre-existing dictionary and some
interesting heuristic matching algorithms.  And that's only half of
Japanese itself, really, since there are two other alphabets (hiragana and
katagana) that *are* character-phonetic like romance langauges, and all
three alphabets are commonly mixed together in the same sentence!  Chinese
language may also have these phonetic alphabets.

In other words, it's not an easy problem!  There is going to be an
unimaginable culture clash when asian and other non-romance languages
catch up to the volume of romance language content on the web.

-Michel