OK, so the TextIndex of a ZCatalog says that it "stems and stops" the words before indexing them and, one would hope, before searching for them. I always thought that "stem" meant "derive the stem of the word" (so as to make the index smaller). I just peeked at the Splitter.c source code for the first time, and that sure ain't it. The American phrase would be "truncate and stop", I think. In any case, "stem" in the source code comments means truncate at MAX_WORD, which is 64 characters. That's an aside. Now, about stopping. There is a list of "stop words" that don't get indexed. Fine. I'm having quite a bit of trouble figuring out exactly where this is happening, but let's ignore that for now on the indexing side. It happens, that's enough for now. Now, what happens to stop words in an input search string?
From my single stepping the code, stopwords are still in the query string while it is being parsed, and get looked up in the index.
So, here is the heart of my problem: consider the search string someword and someotherword Suppose 'someword' is a stopword. It doesn't get indexed because it is considered too common. Now, I would think that if this search string is submitted, the result would be to return the hits for 'someotherword'. This might, however, not be other people's opinions. So, is the fact that TextIndex appears to return the null set in this case a bug or a feature? I say 'appears' because I actually get 2 (out of about 2000 with the keyword 'car') hits in my database when I search on 'car and the'. I tried to single step through the logic using the debugger, but when the call is made made to the splitter with the stopword passed in, python core dumps. I can do 'from SearchIndex.Splitter import Splitter', and call Splitter, and see that stopwords are not removed, but I can't do 'from SearchIndex.UnTextIndex import Splitter' because it complains about not being able to import Persistent from Persistence. (*That* problem was reported by someone else in another context not too long ago.) However, it's pretty clear that this null set return is what is happening, since when the evaluate subroutine is entered, the stop word is in the partially parsed string, and is in fact passed to the Splitter in the __getitem__ of the text index. If the splitter stopped it, the returned result set would be None, If the splitter doesn't stop it, the text index is still going return a null set as the result for that word, since it doesn't appear in the index by definition. An 'and' of any result set with None is going to be the null set. So it looks like the thing was designed this way: the stop words get "deleted" from the search string by not being in the index and by therefore returning null sets when looked up. This works fine for 'or' logic, but not for 'and' logic, IMO. Contrary opinions? Helpful hints? If I'm right and this needs fixed, it's going to be a bit of a bear to do, I think. (Where those two hits are coming from is a *real* mystery, but one I'm going to ignore for a little while yet since I can't yet get the debugger to work for me without crashing. I have a sneaking suspicion it is related to my confusion about where stopwords get removed in the indexing process, but it will probably take a while for me to prove or disprove that notion.) --RDM