I just came across this post of yours. Curious: Have you guys looked at ternary search trees [1] as an alternative dictionary algorithm? Compared to B-trees and their variants, ternary trees are extremely space-efficient, and lend themselves greatly to partial searching -- such as finding words starting with a certain substring, or pure wildcard matches -- and nearest-neighbour searching. They're not just simple in structure, they're lightning fast, too: Searching for a string of length k in a tree containing n strings will require at most O(log n+k) comparisons. A semirecent article [2] explained the concept in detail. I've successfully employed them and can personally verify their usefulness and speed. Because they're light, the I/O required to process searches is negible compared to other methods such as hash tables or binary trees (and their variants). Unlike binary trees, non-recursive search functions are just as simple to write as recursive ones (the only stack space theoretically needed is the search string and your node representation); balancing is generally not needed, although it is possible to enforce; finally, ternary trees have low overhead for insertion and successful searches. Ternary trees get their names from having three children per node, contrary to the binary tree family, which use two. Each node contains a "split character" and three "low", "equal", and "high" child pointers. The split character is an 8-bit (or 16-bit UNICODE) character value. Let's say we store the words FOO, FRED, FRUIT, ZOPE, and BAR, in that order. The three built is thus (I use "." to denote a null splitchar): F /|\ B O \ | |\ \ A O R \ | | | \ R . E \ | |\ Z . D U | | | O . I | | P T | | E . | . Let's say we're searching for the word "FRUIT\0" (a terminating null character is used to avoid matching with stuff like "FRUITCAKE"; if we leave out the terminating null, this becomes a partial-match search). First, we find the root node, "F". It's equal to our starting character, so we increment our character pointer and go to the node's "equal" child, "O". R > O, so we proceed to the "high" child and find "R". R == R, so we increment our pointer, proceed to the "equal" child, and find "E". U > E, so we proceed to the "high" child, and we find "U". U == U, so we increment our pointer, and so on. We know we've found a singular entry for "FRUIT" because the final splitchar is null. How to traverse, and insert into, this tree should be obvious to anyone. Notice how easily partial/expression searches can be done, since the tree itself lends itself to string-like operations. .. [1] "Fast Algorithms for Sorting and Searching Strings", Jon Bentley/Bob Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, January 1997. .. [2] Ternary Search Trees, Jon Bentley/Bob Sedgewick, Dr. Dobb's Journal, April 1998. I wrote a ternary tree module in Python a while ago. You can get it here: http://www.mop.no/~alex/download/pytrees.tgz. It does not serve to show any benefits in speed or footprint: Python's memory management, string handling, and garbage collection facilities are not tailored to handle this stuff efficiently, so in Python binary trees have the upper hand. It should, however, serve as a nice demo of the algorithm itself. -- Alexander Staubo http://alex.mop.no/ "In the end, we all assume room temperature." --John Maynard Keynes
-----Original Message----- From: Michel Pelletier [mailto:michel@digicool.com] Sent: 30. september 1999 07:55 To: Dave Kimmel Cc: 'zope@zope.org' Subject: Re: [Zope] ZCatalog searching questions
Dave Kimmel wrote:
Hello all
I'm trying to re-implement the internal policy manual here
at work using
Zope (it's currently hyperlinked WordPerfect documents), and after demoing it to management there are now two things I need to solve with the search feature.
First, is there any way I can search by regular expression or search for part of a word? Specifically, I need to search the content of a bunch of DTML Documents (these only contain the text of the policies and references to standard_html_header and standard_html_footer) for partial word matches. For example, "vac" needs to match "vacation", "vacations", "vaccination", etc.
No. This is a vastly harder problem than it sounds like. Think of a real book index, how would arrange the index 'keys' so that you could do something like '*ing'? The immediate solution is to store two 'sub-keys' per word, one with the word forward 'walking' and one with the word backwards ('gniklaw'). This way you can say walk* or *ing and get walking in both instances (and 'walked', 'zopeing' etc...).
The problem is now your 'lexicon' has doubled in size. Try it yourself.
Some people thing, 'why not use re (the Python regex module)?', because searching like '*ing' would require iterating over all the keys, a linear search like this could take multiple order of maginitude more time than a non-regex search.
There is a pretty good compromise solution called n-grams, but they also result in a lexicon increase, and a much more complicated algorithm. I can refer you to a good book that describes them.
The recent abstraction of the Catalog's 'lexicon' will eventually allow you plug custom lexicon objects into the catalog, giving you hooks to impliment this service yourself, if someone doesn't pay us to do it first.
Second, is there any way to use a list of synonyms when searching? For example, a search for "headstone" should also search for "monument" and "gravestone", and likewise a search for "gravestone" should also search for "monument" and "headstone".
Yes. The 'lexicon' has a hardwired 'synonym and stopword' dictionary in lib/python/SearchIndex/Lexicon.py. This is also projected to be improved by allowing through-the-web lexicon managment (like specifying stopwords and synonmys). Someone also suggested interfacing it to some kind of synonym database, you'd have to search through the arvhives to find the reference.
Am I asking too much of this? Should I be buying a Python book and adding this functionality myself? Should I be using something other than ZCatalog? Should I be using something other than Zope? (Please say no, I happen to like Zope!)
Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if you found fully featured regex searching in another package that would take less of a headache to use than just implimenting a simple 'reversed' lexicon that let's you do globbing (like dos wildcard, no *s in the middle of words, etc.).
-Michel
Thank you! -- Dave Kimmel Systems Analyst Office of the Public Trustee, Alberta Justice
_______________________________________________ Zope maillist - Zope@zope.org http://www.zope.org/mailman/listinfo/zope
(Related lists - please, no cross posts or HTML encoding!
To receive general Zope announcements, see: http://www.zope.org/mailman/listinfo/zope-announce
For developer-specific issues, zope-dev@zope.org - http://www.zope.org/mailman/listinfo/zope-dev )