"stemmed and stopped": problems with stopwords and the 'and' operator
OK, so the TextIndex of a ZCatalog says that it "stems and stops" the words before indexing them and, one would hope, before searching for them. I always thought that "stem" meant "derive the stem of the word" (so as to make the index smaller). I just peeked at the Splitter.c source code for the first time, and that sure ain't it. The American phrase would be "truncate and stop", I think. In any case, "stem" in the source code comments means truncate at MAX_WORD, which is 64 characters. That's an aside. Now, about stopping. There is a list of "stop words" that don't get indexed. Fine. I'm having quite a bit of trouble figuring out exactly where this is happening, but let's ignore that for now on the indexing side. It happens, that's enough for now. Now, what happens to stop words in an input search string?
From my single stepping the code, stopwords are still in the query string while it is being parsed, and get looked up in the index.
So, here is the heart of my problem: consider the search string someword and someotherword Suppose 'someword' is a stopword. It doesn't get indexed because it is considered too common. Now, I would think that if this search string is submitted, the result would be to return the hits for 'someotherword'. This might, however, not be other people's opinions. So, is the fact that TextIndex appears to return the null set in this case a bug or a feature? I say 'appears' because I actually get 2 (out of about 2000 with the keyword 'car') hits in my database when I search on 'car and the'. I tried to single step through the logic using the debugger, but when the call is made made to the splitter with the stopword passed in, python core dumps. I can do 'from SearchIndex.Splitter import Splitter', and call Splitter, and see that stopwords are not removed, but I can't do 'from SearchIndex.UnTextIndex import Splitter' because it complains about not being able to import Persistent from Persistence. (*That* problem was reported by someone else in another context not too long ago.) However, it's pretty clear that this null set return is what is happening, since when the evaluate subroutine is entered, the stop word is in the partially parsed string, and is in fact passed to the Splitter in the __getitem__ of the text index. If the splitter stopped it, the returned result set would be None, If the splitter doesn't stop it, the text index is still going return a null set as the result for that word, since it doesn't appear in the index by definition. An 'and' of any result set with None is going to be the null set. So it looks like the thing was designed this way: the stop words get "deleted" from the search string by not being in the index and by therefore returning null sets when looked up. This works fine for 'or' logic, but not for 'and' logic, IMO. Contrary opinions? Helpful hints? If I'm right and this needs fixed, it's going to be a bit of a bear to do, I think. (Where those two hits are coming from is a *real* mystery, but one I'm going to ignore for a little while yet since I can't yet get the debugger to work for me without crashing. I have a sneaking suspicion it is related to my confusion about where stopwords get removed in the indexing process, but it will probably take a while for me to prove or disprove that notion.) --RDM
On Thu, Aug 17, 2000 at 02:23:59AM -0400, R. David Murray wrote:
I can do 'from SearchIndex.Splitter import Splitter', and call Splitter, and see that stopwords are not removed, but I can't do 'from SearchIndex.UnTextIndex import Splitter' because it complains about not being able to import Persistent from Persistence. (*That* problem was reported by someone else in another context not too long ago.)
No clues as to where you'll find the stopword code, but the Persistence thingy is caused by the magic that ZODB performs: it initializes the correct Persistence module when it itself is imported. This way Jim managed to have ZODB3 and BoboPOS2 exist in the same Zope distribution. Do an import ZODB before you do your Splitter import, and all will be dandy. -- Martijn Pieters | Software Engineer mailto:mj@digicool.com | Digital Creations http://www.digicool.com/ | Creators of Zope http://www.zope.org/ | ZopeStudio: http://www.zope.org/Products/ZopeStudio -----------------------------------------------------
On Thu, 17 Aug 2000, Martijn Pieters wrote:
No clues as to where you'll find the stopword code, but the Persistence thingy is caused by the magic that ZODB performs: it initializes the correct Persistence module when it itself is imported. This way Jim managed to have ZODB3 and BoboPOS2 exist in the same Zope distribution.
Do an import ZODB before you do your Splitter import, and all will be dandy.
Thanks, worked like a charm. I think I've found the stopword code. To cement my understanding I'm going to write this up. Maybe somebody will find it useful <grin>. UnTextIndex accesses the splitter through the Splitter method of the Lexicon associated with the index. That Lexicon instance is created when the Vocabulary or Catalog are created. (Comments in the code indicate that in the future each TextIndex could have its own Lexicon, which makes sense to me.) A Lexicon instance can be passed a list of stop words (and/or synonyms) when it is initialized. Vocabulary does this for Lexicon (but not GlobbingLexicon, which internal comments indicates does not use stopwords). The Lexicon instance stores this list in a property, and passes it to the real Splitter when its Splitter method is called. So the fix that I submitted earlier today to the collector for the 'and' involving stopwords should work for 'listed' stopwords as well as the punctuation and numbers that I was able to test it on. (In my comments in the patch I said I wasn't sure). I still can't test it because I'm using a Globbing lexicon <wry grin>. In perusing the code I'm also feeling more confident that the change I made to __getitem__ in that fix is in fact semantically correct. Or at least consistent with the rest of the __getitem__ code. GlobbingLexicon not using stopwords also explains the few hits on 'the and car' that I got that I was confused by. Those entries really must have 'the' as an indexed term, unlike the rest. Oh, by the way, the comments in TextIndex seem to agree with me as to the conventional meaning of the word 'stemmed' <grin>. --RDM
participants (2)
-
Martijn Pieters -
R. David Murray