Re: [Zope-dev] ZCatalog and 'fuzzy logic'
Morten W. Petersen writes:
Is there anyone who could try to give an estimate of how long it would take to add fuzzy logic (regexp-like) searching capability to the ZCatalog? I do not think that "fuzzy logic" is strongly related to "regexp-like". Anyway.
Fuzzy searching often means "finding matches with characters omitted, replaced or inserted". Zope's globbing vocabularies support wildcards '*' and '?'. To implement wildcard based searches efficiently, they index words under their two letter consitutents. When you now get a pattern, you derive from the pattern what two letter constituents the matching words must have and retrieve them. This defines a candidate word set. Then you check, whether the retrieved word really match the expression. You can extend this algorithm to get fuzzy searches. Dieter
I do not think that "fuzzy logic" is strongly related to "regexp-like". Anyway.
Fuzzy searching often means "finding matches with characters omitted, replaced or inserted".
It seems I misunderstood the term fuzzy logic myself. Fuzzy logic means if I search for a word, for example 'programmer', it will return matches to the words 'program', 'programming','programmable' etc. I.e., it will somewhat intelligently return words that are similar in what they mean, using grammar rules (chopping off endings of words and making them match others). Hmm. Cheers, Morten
On Wed, 10 Jan 2001, Morten W. Petersen wrote:
I do not think that "fuzzy logic" is strongly related to "regexp-like". Anyway.
Fuzzy searching often means "finding matches with characters omitted, replaced or inserted".
It seems I misunderstood the term fuzzy logic myself. Fuzzy logic means if I search for a word, for example 'programmer', it will return matches to the words 'program', 'programming','programmable' etc.
I think your talking about something else. Last i checked, "fuzzy logic" was a logical algebra based on the existence of intermediate truth states, between "true" and "false". It has little or nothing to do with aproximate searching, though i guess you could use it to make assertions about the aproximations. I think what you all are talking about is "fuzzy matching".
I.e., it will somewhat intelligently return words that are similar in what they mean, using grammar rules (chopping off endings of words and making them match others).
There are also matching mechanisms like soundex, that account for misspelling by translating words to phonetic-equivalent normalized codes, and comparing on that basis. Ken klm@digicool.com
Morten W. Petersen writes:
It seems I misunderstood the term fuzzy logic myself. Fuzzy logic means if I search for a word, for example 'programmer', it will return matches to the words 'program', 'programming','programmable' etc. This, usually, is called "stemming". Though, your examples indicate quite a strong form of it.
If you have some tool, maybe LinguistX, that map from a word to its stem and then from the stem to all words with this as stem (or directly give the stem equivalence class of a word), then it is quite easy to incorporate that in Zope's catalog. However, to do that cleanly, you will need good algorithms and/or large dictionaries. This, usually, is not free of charge. Dieter
"Morten W. Petersen" wrote:
It seems I misunderstood the term fuzzy logic myself. Fuzzy logic means if I search for a word, for example 'programmer', it will return matches to the words 'program', 'programming','programmable' etc.
I.e., it will somewhat intelligently return words that are similar in what they mean, using grammar rules (chopping off endings of words and making them match others).
Hmmm. This makes the Catalog language specific, as you need different stemming rules for different languages (for languages that use stemming, I'm not sure that all do - to my (extremely limited) knowledge, e.g., Chinese (Mandarin) does not). So you need to indicate the language when inserting text into the Catalog (so you can index its base) and when searching. Alternatively, you indicate the language only at the search and have the search engine look for all derived forms of the same base. Probably slower but easier to implement. Jan
participants (4)
-
Dieter Maurer -
Jan H. Haul -
Ken Manheimer -
Morten W. Petersen