ZCatalog searching questions
Hello all I'm trying to re-implement the internal policy manual here at work using Zope (it's currently hyperlinked WordPerfect documents), and after demoing it to management there are now two things I need to solve with the search feature. First, is there any way I can search by regular expression or search for part of a word? Specifically, I need to search the content of a bunch of DTML Documents (these only contain the text of the policies and references to standard_html_header and standard_html_footer) for partial word matches. For example, "vac" needs to match "vacation", "vacations", "vaccination", etc. Second, is there any way to use a list of synonyms when searching? For example, a search for "headstone" should also search for "monument" and "gravestone", and likewise a search for "gravestone" should also search for "monument" and "headstone". Am I asking too much of this? Should I be buying a Python book and adding this functionality myself? Should I be using something other than ZCatalog? Should I be using something other than Zope? (Please say no, I happen to like Zope!) Thank you! -- Dave Kimmel Systems Analyst Office of the Public Trustee, Alberta Justice
Dave Kimmel wrote:
Hello all
I'm trying to re-implement the internal policy manual here at work using Zope (it's currently hyperlinked WordPerfect documents), and after demoing it to management there are now two things I need to solve with the search feature.
First, is there any way I can search by regular expression or search for part of a word? Specifically, I need to search the content of a bunch of DTML Documents (these only contain the text of the policies and references to standard_html_header and standard_html_footer) for partial word matches. For example, "vac" needs to match "vacation", "vacations", "vaccination", etc.
No. This is a vastly harder problem than it sounds like. Think of a real book index, how would arrange the index 'keys' so that you could do something like '*ing'? The immediate solution is to store two 'sub-keys' per word, one with the word forward 'walking' and one with the word backwards ('gniklaw'). This way you can say walk* or *ing and get walking in both instances (and 'walked', 'zopeing' etc...). The problem is now your 'lexicon' has doubled in size. Try it yourself. Some people thing, 'why not use re (the Python regex module)?', because searching like '*ing' would require iterating over all the keys, a linear search like this could take multiple order of maginitude more time than a non-regex search. There is a pretty good compromise solution called n-grams, but they also result in a lexicon increase, and a much more complicated algorithm. I can refer you to a good book that describes them. The recent abstraction of the Catalog's 'lexicon' will eventually allow you plug custom lexicon objects into the catalog, giving you hooks to impliment this service yourself, if someone doesn't pay us to do it first.
Second, is there any way to use a list of synonyms when searching? For example, a search for "headstone" should also search for "monument" and "gravestone", and likewise a search for "gravestone" should also search for "monument" and "headstone".
Yes. The 'lexicon' has a hardwired 'synonym and stopword' dictionary in lib/python/SearchIndex/Lexicon.py. This is also projected to be improved by allowing through-the-web lexicon managment (like specifying stopwords and synonmys). Someone also suggested interfacing it to some kind of synonym database, you'd have to search through the arvhives to find the reference.
Am I asking too much of this? Should I be buying a Python book and adding this functionality myself? Should I be using something other than ZCatalog? Should I be using something other than Zope? (Please say no, I happen to like Zope!)
Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if you found fully featured regex searching in another package that would take less of a headache to use than just implimenting a simple 'reversed' lexicon that let's you do globbing (like dos wildcard, no *s in the middle of words, etc.). -Michel
Thank you! -- Dave Kimmel Systems Analyst Office of the Public Trustee, Alberta Justice
On Thu, 30 Sep 1999, Michel Pelletier wrote:
Some people thing, 'why not use re (the Python regex module)?', because searching like '*ing' would require iterating over all the keys, a linear search like this could take multiple order of maginitude more time than a non-regex search.
But this is not a problem for many of us who arn't trying to index libraries. Perhaps a way of specifying 'extended' searches to ZCatalog (but still allowing normal, quick searches), implemented as a linear regex search through the index. And if this was too slow for some people, it would provide the hook they need to replace Catalog with something that met their requirements: GlimpseCatalog - indexes are dumped to text files that are indexed using Glimpse. Extended syntax would be the fuzzy regexp based matches used by Glimpse (as implemented by agrep) for those sites whose visitors can't spell :-) RDBCatalog - indexes are stored in a backend RDBMS, and substring matches are done using SQL. ConTextCatalog - interfaces to Oracle and the ConText option for people who require funky soundex matches and the various wierd options it provides. And Ultraseek uses python doesn't it? Might be a trivial mating. ___ // Zen (alias Stuart Bishop) Work: zen@cs.rmit.edu.au // E N Senior Systems Alchemist Play: zen@shangri-la.dropbear.id.au //__ Computer Science, RMIT WWW: http://www.cs.rmit.edu.au/~zen
participants (3)
-
Dave Kimmel -
Michel Pelletier -
Stuart 'Zen' Bishop