[Zope-dev] "stemmed and stopped": problems with stopwords and the 'and' operator

17 Aug 2000

      OK, so the TextIndex of a ZCatalog says that it "stems and stops" the
words before indexing them and, one would hope, before searching
for them.

I always thought that "stem" meant "derive the stem of the word" (so
as to make the index smaller).  I just peeked at the Splitter.c
source code for the first time, and that sure ain't it.  The American
phrase would be "truncate and stop", I think.  In any case, "stem"
in the source code comments means truncate at MAX_WORD, which is
64 characters.

That's an aside.

Now, about stopping.  There is a list of "stop words" that don't
get indexed.  Fine.  I'm having quite a bit of trouble figuring
out exactly where this is happening, but let's ignore that
for now on the indexing side.  It happens, that's enough for now.

Now, what happens to stop words in an input search string?
...
From my single stepping the code, stopwords are still in
the query string while it is being parsed, and get looked
up in the index.
So, here is the heart of my problem:  consider the search string

  someword and someotherword

Suppose 'someword' is a stopword.  It doesn't get indexed because
it is considered too common.  Now, I would think that if this
search string is submitted, the result would be to return the
hits for 'someotherword'.  This might, however, not be other
people's opinions.  So, is the fact that TextIndex appears to
return the null set in this case a bug or a feature?

I say 'appears' because I actually get 2 (out of about 2000 with
the keyword 'car') hits in my database when I search on 'car and
the'.  I tried to single step through the logic using the debugger,
but when the call is made made to the splitter with the stopword
passed in, python core dumps.  I can do 'from SearchIndex.Splitter
import Splitter', and call Splitter, and see that stopwords are
not removed, but I can't do 'from SearchIndex.UnTextIndex import
Splitter' because it complains about not being able to import
Persistent from Persistence.  (*That* problem was reported by someone
else in another context not too long ago.)

However, it's pretty clear that this null set return is what is
happening, since when the evaluate subroutine is entered, the stop
word is in the partially parsed string, and is in fact passed to
the Splitter in the __getitem__ of the text index.  If the splitter
stopped it, the returned result set would be None, If the splitter
doesn't stop it, the text index is still going return a null set
as the result for that word, since it doesn't appear in the index
by definition.  An 'and' of any result set with None is going to
be the null set.

So it looks like the thing was designed this way:  the stop words
get "deleted" from the search string by not being in the index and
by therefore returning null sets when looked up.  This works fine for
'or' logic, but not for 'and' logic, IMO.

Contrary opinions?  Helpful hints?  If I'm right and this needs
fixed, it's going to be a bit of a bear to do, I think.

(Where those two hits are coming from is a *real* mystery, but one
I'm going to ignore for a little while yet since I can't yet get
the debugger to work for me without crashing.  I have a sneaking
suspicion it is related to my confusion about where stopwords
get removed in the indexing process, but it will probably take
a while for me to prove or disprove that notion.)

--RDM

[Zope-dev] "stemmed and stopped": problems with stopwords and the 'and' operator

R. David Murray