[Zope-dev] SearchIndex Splitter lowercase indexes?

Christian Robottom Reis kiko@async.com.br
Thu, 24 May 2001 20:24:23 -0300 (BRT)


On Thu, 24 May 2001, Michel Pelletier wrote:

> This is a very common indexing strategy to save space and make searches
> more relevant.  Otherwise 'Dog' and 'dog' would return two completely
> different result sets.

Fine. However:

>>> s.indexes('Foo')
[]

Is _this_ supposed to happen, too? Ah, I guess to. It's the problem with
using this outside of Zope. :-) I couldn't figure out what it was for.

> find the same words.  The splitter can also be passed a mapping of
> synonyms, so you can tell the splitter that "automobile" "ford" and "lisp"
> are all synonymous to the word "car".

Yes, I've seen this in stop_words_dict from Lexicon.py.

> > It makes TextIndex's position() call behave
> > unexpectedly until you do some tests with the Splitter itself!
>
> position() is currently unimplemented, isn't it?  so does it
> matter?  Also, I don't know what your doing with position() but anytime

Uhhh, no, it _is_ implemented. It just didn't work like I'd expect :-)

>>> index.positions(1,['crazy'])
[2]
>>> index.positions(1,'crazy')
[]
>>> index.positions(1,['Crazy'])
[]

So it does look lowercase words up. Of course, this is an artifact of the
following point you make:

> you want to look up things in a text index, use the same splitter to munge
> the content before querying the index, otherwise, you may end up not
> finding what you're looking for.

This makes sense:

>>> s = Splitter("Crazy")
>>> index.positions(1,s)
[2]

Ahhm. Okay. Will update my documentation with this important point.

> In other words, it's not an easy problem!  There is going to be an
> unimaginable culture clash when asian and other non-romance languages
> catch up to the volume of romance language content on the web.

Fascinating points on i18n and l10n of the indexing mechanism. Makes me
wonder how far the current implementation will go before having to be
rewritten, and if the world will survive east-meets-the-west of computing
text.

But I believe the Splitter could stay the same for western languages, from
what I've seen of the code. Can't really see the ing-cutting stuff here.

Take care,
--
/\/\ Christian Reis, Senior Engineer, Async Open Source, Brazil
~\/~ http://async.com.br/~kiko/ | [+55 16] 274 4311