Dave Kimmel wrote:
Hello all
I'm trying to re-implement the internal policy manual here at work using Zope (it's currently hyperlinked WordPerfect documents), and after demoing it to management there are now two things I need to solve with the search feature.
First, is there any way I can search by regular expression or search for part of a word? Specifically, I need to search the content of a bunch of DTML Documents (these only contain the text of the policies and references to standard_html_header and standard_html_footer) for partial word matches. For example, "vac" needs to match "vacation", "vacations", "vaccination", etc.
No. This is a vastly harder problem than it sounds like. Think of a real book index, how would arrange the index 'keys' so that you could do something like '*ing'? The immediate solution is to store two 'sub-keys' per word, one with the word forward 'walking' and one with the word backwards ('gniklaw'). This way you can say walk* or *ing and get walking in both instances (and 'walked', 'zopeing' etc...). The problem is now your 'lexicon' has doubled in size. Try it yourself. Some people thing, 'why not use re (the Python regex module)?', because searching like '*ing' would require iterating over all the keys, a linear search like this could take multiple order of maginitude more time than a non-regex search. There is a pretty good compromise solution called n-grams, but they also result in a lexicon increase, and a much more complicated algorithm. I can refer you to a good book that describes them. The recent abstraction of the Catalog's 'lexicon' will eventually allow you plug custom lexicon objects into the catalog, giving you hooks to impliment this service yourself, if someone doesn't pay us to do it first.
Second, is there any way to use a list of synonyms when searching? For example, a search for "headstone" should also search for "monument" and "gravestone", and likewise a search for "gravestone" should also search for "monument" and "headstone".
Yes. The 'lexicon' has a hardwired 'synonym and stopword' dictionary in lib/python/SearchIndex/Lexicon.py. This is also projected to be improved by allowing through-the-web lexicon managment (like specifying stopwords and synonmys). Someone also suggested interfacing it to some kind of synonym database, you'd have to search through the arvhives to find the reference.
Am I asking too much of this? Should I be buying a Python book and adding this functionality myself? Should I be using something other than ZCatalog? Should I be using something other than Zope? (Please say no, I happen to like Zope!)
Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if you found fully featured regex searching in another package that would take less of a headache to use than just implimenting a simple 'reversed' lexicon that let's you do globbing (like dos wildcard, no *s in the middle of words, etc.). -Michel
Thank you! -- Dave Kimmel Systems Analyst Office of the Public Trustee, Alberta Justice