[Zope] Follow up on Zcatalog weirdness.

Stuart Woolford stuartw@newmail.net
Mon, 13 Sep 1999 11:12:30 +1200


On Sun, 12 Sep 1999, Martijn Pieters wrote:
> At 14:57 11-9-99 , Kuraiken wrote:
> >It seems it's not words less than 4 letters. I also have a CD titled "Hi 
> >Ka Ri"
> >and searching for "hi", "ka" and "ri" respectively works as advertised. Which
> >leads me to suspect that, for some reason, the words: for, you, me and to are
> >"special". In that search refuses to find them.
> >
> >I'd sleep better if others could confirm this. (Perhaps the daemon under the
> >hood is lazy or something... :-))
> 
> The ZCatalog indeed ignores certain words. They are listed in 
> lib\python\SearchIndex\TextIndex.py (at the end), and they are called Stop 
> Words. ZCatalog does not index these because they are considered to be part 
> of the 'fluff' of text, they are not relevant keywords in most texts. They 
> are also too common in textxs to be of any use to pinpoint a particular 
> document. If ZCatalog would index these, you index would blow up like a 
> balloon with irrelevant and useless data. This is something all text 
> indexers do.
> 
> You also said you couldn't get your ZClasses to update the Catalog. Make 
> sure you have chosen CatalogAware as the first base class (it should be 
> listed as the second class on the Basic tab as _ZClass_for_CatalogAware), 
> and you should call index_object after every change on the object.

A bit off the top of my head here, but if these words are ibeing removed from
the indexing, should not the search engine assume they are present in all
searched data? It's been a while since I last wrote a text search engine, but
thiat is what I ended up doing for 'common' class words, otherwise people end
up not finding what they search for :( - this can of course be simulated by
removing them in the correct manner from the search criteria..

A better idea I found was to have a threshold length of a string, and index
common words based on this (ie: index everything in a 5 word string, index only
rare words in a 10000 word string) - this helps by supplying a lot more context
for the short (and therefore harder to find) strings.

Good text matching is very very hard :( people are never satifsfied unless you
find EXACTLY what they want on the first go :(

------------------------------------------------------------
Stuart Woolford, stuartw@newmail.net
Unix Consultant.
Software Developer.
Supra Club of New Zealand.
------------------------------------------------------------