Follow up on Zcatalog weirdness.
It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them. I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-)) -- ----------------------------------------- Kuraiken - Python fanatic. ----------------------------------------- Python. Try it. It'll swallow you whole! -----------------------------------------
At 14:57 11-9-99 , Kuraiken wrote:
It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them.
I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-))
The ZCatalog indeed ignores certain words. They are listed in lib\python\SearchIndex\TextIndex.py (at the end), and they are called Stop Words. ZCatalog does not index these because they are considered to be part of the 'fluff' of text, they are not relevant keywords in most texts. They are also too common in textxs to be of any use to pinpoint a particular document. If ZCatalog would index these, you index would blow up like a balloon with irrelevant and useless data. This is something all text indexers do. You also said you couldn't get your ZClasses to update the Catalog. Make sure you have chosen CatalogAware as the first base class (it should be listed as the second class on the Basic tab as _ZClass_for_CatalogAware), and you should call index_object after every change on the object. -- Martijn Pieters, Web Developer | Antraciet http://www.antraciet.nl | T: +31 35 7502100 F: +31 35 7502111 | mj@antraciet.nl http://www.antraciet.nl/~mj | PGP: http://wwwkeys.nl.pgp.net:11371/pks/lookup?op=get&search=0xA8A32149 ---------------------------------------------
On Sun, 12 Sep 1999, Martijn Pieters wrote:
At 14:57 11-9-99 , Kuraiken wrote:
It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them.
I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-))
The ZCatalog indeed ignores certain words. They are listed in lib\python\SearchIndex\TextIndex.py (at the end), and they are called Stop Words. ZCatalog does not index these because they are considered to be part of the 'fluff' of text, they are not relevant keywords in most texts. They are also too common in textxs to be of any use to pinpoint a particular document. If ZCatalog would index these, you index would blow up like a balloon with irrelevant and useless data. This is something all text indexers do.
You also said you couldn't get your ZClasses to update the Catalog. Make sure you have chosen CatalogAware as the first base class (it should be listed as the second class on the Basic tab as _ZClass_for_CatalogAware), and you should call index_object after every change on the object.
A bit off the top of my head here, but if these words are ibeing removed from the indexing, should not the search engine assume they are present in all searched data? It's been a while since I last wrote a text search engine, but thiat is what I ended up doing for 'common' class words, otherwise people end up not finding what they search for :( - this can of course be simulated by removing them in the correct manner from the search criteria.. A better idea I found was to have a threshold length of a string, and index common words based on this (ie: index everything in a 5 word string, index only rare words in a 10000 word string) - this helps by supplying a lot more context for the short (and therefore harder to find) strings. Good text matching is very very hard :( people are never satifsfied unless you find EXACTLY what they want on the first go :( ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
Stuart Woolford wrote:
A better idea I found was to have a threshold length of a string, and index common words based on this (ie: index everything in a 5 word string, index only rare words in a 10000 word string) - this helps by supplying a lot more context for the short (and therefore harder to find) strings.
That's a pretty good idea, I'll think about that.
Good text matching is very very hard :( people are never satifsfied unless you find EXACTLY what they want on the first go :(
I'm glad you said that, maybe I'll get a raise! -Michel
participants (4)
-
Kuraiken -
Martijn Pieters -
Michel Pelletier -
Stuart Woolford