[Zope] ZCatalog searching questions

Michel Pelletier michel@digicool.com
Thu, 30 Sep 1999 01:55:23 -0400


Dave Kimmel wrote:
> 
> Hello all
> 
> I'm trying to re-implement the internal policy manual here at work using
> Zope (it's currently hyperlinked WordPerfect documents), and after demoing
> it to management there are now two things I need to solve with the search
> feature.
> 
> First, is there any way I can search by regular expression or search for
> part of a word?  Specifically, I need to search the content of a bunch of
> DTML Documents (these only contain the text of the policies and references
> to standard_html_header and standard_html_footer) for partial word matches.
> For example, "vac" needs to match "vacation", "vacations", "vaccination",
> etc.

No.  This is a vastly harder problem than it sounds like.  Think of a
real book index, how would arrange the index 'keys' so that you could do
something like '*ing'?  The immediate solution is to store two
'sub-keys' per word, one with the word forward 'walking' and one with
the word backwards ('gniklaw').  This way you can say walk* or *ing and
get walking in both instances (and 'walked', 'zopeing' etc...).

The problem is now your 'lexicon' has doubled in size.  Try it yourself.

Some people thing, 'why not use re (the Python regex module)?', because
searching like '*ing' would require iterating over all the keys, a
linear search like this could take multiple order of maginitude more
time than a non-regex search.

There is a pretty good compromise solution called n-grams, but they also
result in a lexicon increase, and a much more complicated algorithm.  I
can refer you to a good book that describes them.

The recent abstraction of the Catalog's 'lexicon' will eventually allow
you plug custom lexicon objects into the catalog, giving you hooks to
impliment this service yourself, if someone doesn't pay us to do it
first.
 
> Second, is there any way to use a list of synonyms when searching?  For
> example, a search for "headstone" should also search for "monument" and
> "gravestone", and likewise a search for "gravestone" should also search for
> "monument" and "headstone".

Yes.  The 'lexicon' has a hardwired 'synonym and stopword' dictionary in
lib/python/SearchIndex/Lexicon.py.  This is also projected to be
improved by allowing through-the-web lexicon managment (like specifying
stopwords and synonmys).  Someone also suggested interfacing it to some
kind of synonym database, you'd have to search through the arvhives to
find the reference.
 
> Am I asking too much of this?  Should I be buying a Python book and adding
> this functionality myself?  Should I be using something other than ZCatalog?
> Should I be using something other than Zope?  (Please say no, I happen to
> like Zope!)

Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if
you found fully featured regex searching in another package that would
take less of a headache to use than just implimenting a simple
'reversed' lexicon that let's you do globbing (like dos wildcard, no *s
in the middle of words, etc.).

-Michel

> Thank you!
> -- Dave Kimmel
> Systems Analyst
> Office of the Public Trustee, Alberta Justice