Follow up on Zcatalog weirdness. - Zope - Zope lists

newer
Re: [Zope] Many Zope questions

Follow up on Zcatalog weirdness.

older
Using "Session" to store...

Kuraiken

11 Sep 1999 11 Sep '99

12:57 p.m.

It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them. I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-)) -- ----------------------------------------- Kuraiken - Python fanatic. ----------------------------------------- Python. Try it. It'll swallow you whole! -----------------------------------------

Reply

Sign in to reply online Use email software

Show replies by date

Martijn Pieters

11 Sep 11 Sep

4:13 p.m.

New subject: [Zope] Follow up on Zcatalog weirdness.

At 14:57 11-9-99 , Kuraiken wrote:

It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them.

I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-))

The ZCatalog indeed ignores certain words. They are listed in lib\python\SearchIndex\TextIndex.py (at the end), and they are called Stop Words. ZCatalog does not index these because they are considered to be part of the 'fluff' of text, they are not relevant keywords in most texts. They are also too common in textxs to be of any use to pinpoint a particular document. If ZCatalog would index these, you index would blow up like a balloon with irrelevant and useless data. This is something all text indexers do. You also said you couldn't get your ZClasses to update the Catalog. Make sure you have chosen CatalogAware as the first base class (it should be listed as the second class on the Basic tab as _ZClass_for_CatalogAware), and you should call index_object after every change on the object. -- Martijn Pieters, Web Developer | Antraciet http://www.antraciet.nl | T: +31 35 7502100 F: +31 35 7502111 | mj@antraciet.nl http://www.antraciet.nl/~mj | PGP: http://wwwkeys.nl.pgp.net:11371/pks/lookup?op=get&search=0xA8A32149 ---------------------------------------------

Reply

Sign in to reply online Use email software

Stuart Woolford

12 Sep 12 Sep

11:12 p.m.

New subject: [Zope] Follow up on Zcatalog weirdness.

On Sun, 12 Sep 1999, Martijn Pieters wrote:

At 14:57 11-9-99 , Kuraiken wrote:

...
It seems it's not words less than 4 letters. I also have a CD titled "Hi Ka Ri" and searching for "hi", "ka" and "ri" respectively works as advertised. Which leads me to suspect that, for some reason, the words: for, you, me and to are "special". In that search refuses to find them.

I'd sleep better if others could confirm this. (Perhaps the daemon under the hood is lazy or something... :-))

The ZCatalog indeed ignores certain words. They are listed in lib\python\SearchIndex\TextIndex.py (at the end), and they are called Stop Words. ZCatalog does not index these because they are considered to be part of the 'fluff' of text, they are not relevant keywords in most texts. They are also too common in textxs to be of any use to pinpoint a particular document. If ZCatalog would index these, you index would blow up like a balloon with irrelevant and useless data. This is something all text indexers do.

You also said you couldn't get your ZClasses to update the Catalog. Make sure you have chosen CatalogAware as the first base class (it should be listed as the second class on the Basic tab as _ZClass_for_CatalogAware), and you should call index_object after every change on the object.

A bit off the top of my head here, but if these words are ibeing removed from the indexing, should not the search engine assume they are present in all searched data? It's been a while since I last wrote a text search engine, but thiat is what I ended up doing for 'common' class words, otherwise people end up not finding what they search for :( - this can of course be simulated by removing them in the correct manner from the search criteria.. A better idea I found was to have a threshold length of a string, and index common words based on this (ie: index everything in a 5 word string, index only rare words in a 10000 word string) - this helps by supplying a lot more context for the short (and therefore harder to find) strings. Good text matching is very very hard :( people are never satifsfied unless you find EXACTLY what they want on the first go :( ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------

Reply

Sign in to reply online Use email software

Michel Pelletier

13 Sep 13 Sep

1:33 p.m.

New subject: [Zope] Follow up on Zcatalog weirdness.

Stuart Woolford wrote:

A better idea I found was to have a threshold length of a string, and index common words based on this (ie: index everything in a 5 word string, index only rare words in a 10000 word string) - this helps by supplying a lot more context for the short (and therefore harder to find) strings.

That's a pretty good idea, I'll think about that.

Good text matching is very very hard :( people are never satifsfied unless you find EXACTLY what they want on the first go :(

I'm glad you said that, maybe I'll get a raise! -Michel

Reply

Sign in to reply online Use email software

9703

Age (days ago)

9705

Last active (days ago)

3 comments

4 participants

tags

participants (4)

Kuraiken
Martijn Pieters
Michel Pelletier
Stuart Woolford