[Zope] ZCatalog searching questions
Alexander Staubo
alex@mop.no
Thu, 14 Oct 1999 03:09:54 +0200
I just came across this post of yours. Curious: Have you guys looked at
ternary search trees [1] as an alternative dictionary algorithm?
Compared to B-trees and their variants, ternary trees are extremely
space-efficient, and lend themselves greatly to partial searching --
such as finding words starting with a certain substring, or pure
wildcard matches -- and nearest-neighbour searching. They're not just
simple in structure, they're lightning fast, too: Searching for a string
of length k in a tree containing n strings will require at most O(log
n+k) comparisons. A semirecent article [2] explained the concept in
detail. I've successfully employed them and can personally verify their
usefulness and speed.
Because they're light, the I/O required to process searches is negible
compared to other methods such as hash tables or binary trees (and their
variants). Unlike binary trees, non-recursive search functions are just
as simple to write as recursive ones (the only stack space theoretically
needed is the search string and your node representation); balancing is
generally not needed, although it is possible to enforce; finally,
ternary trees have low overhead for insertion and successful searches.
Ternary trees get their names from having three children per node,
contrary to the binary tree family, which use two. Each node contains a
"split character" and three "low", "equal", and "high" child pointers.
The split character is an 8-bit (or 16-bit UNICODE) character value.
Let's say we store the words FOO, FRED, FRUIT, ZOPE, and BAR, in that
order. The three built is thus (I use "." to denote a null splitchar):
F
/|\
B O \
| |\ \
A O R \
| | | \
R . E \
| |\ Z
. D U |
| | O
. I |
| P
T |
| E
. |
.
Let's say we're searching for the word "FRUIT\0" (a terminating null
character is used to avoid matching with stuff like "FRUITCAKE"; if we
leave out the terminating null, this becomes a partial-match search).
First, we find the root node, "F". It's equal to our starting character,
so we increment our character pointer and go to the node's "equal"
child, "O". R > O, so we proceed to the "high" child and find "R". R ==
R, so we increment our pointer, proceed to the "equal" child, and find
"E". U > E, so we proceed to the "high" child, and we find "U". U == U,
so we increment our pointer, and so on. We know we've found a singular
entry for "FRUIT" because the final splitchar is null.
How to traverse, and insert into, this tree should be obvious to anyone.
Notice how easily partial/expression searches can be done, since the
tree itself lends itself to string-like operations.
.. [1] "Fast Algorithms for Sorting and Searching Strings", Jon
Bentley/Bob Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium
on Discrete Algorithms, January 1997.
.. [2] Ternary Search Trees, Jon Bentley/Bob Sedgewick, Dr. Dobb's
Journal, April 1998.
I wrote a ternary tree module in Python a while ago. You can get it
here: http://www.mop.no/~alex/download/pytrees.tgz. It does not serve to
show any benefits in speed or footprint: Python's memory management,
string handling, and garbage collection facilities are not tailored to
handle this stuff efficiently, so in Python binary trees have the upper
hand. It should, however, serve as a nice demo of the algorithm itself.
--
Alexander Staubo http://alex.mop.no/
"In the end, we all assume room temperature." --John Maynard Keynes
> -----Original Message-----
> From: Michel Pelletier [mailto:michel@digicool.com]
> Sent: 30. september 1999 07:55
> To: Dave Kimmel
> Cc: 'zope@zope.org'
> Subject: Re: [Zope] ZCatalog searching questions
>
>
> Dave Kimmel wrote:
> >
> > Hello all
> >
> > I'm trying to re-implement the internal policy manual here
> at work using
> > Zope (it's currently hyperlinked WordPerfect documents),
> and after demoing
> > it to management there are now two things I need to solve
> with the search
> > feature.
> >
> > First, is there any way I can search by regular expression
> or search for
> > part of a word? Specifically, I need to search the content
> of a bunch of
> > DTML Documents (these only contain the text of the policies
> and references
> > to standard_html_header and standard_html_footer) for
> partial word matches.
> > For example, "vac" needs to match "vacation", "vacations",
> "vaccination",
> > etc.
>
> No. This is a vastly harder problem than it sounds like. Think of a
> real book index, how would arrange the index 'keys' so that
> you could do
> something like '*ing'? The immediate solution is to store two
> 'sub-keys' per word, one with the word forward 'walking' and one with
> the word backwards ('gniklaw'). This way you can say walk*
> or *ing and
> get walking in both instances (and 'walked', 'zopeing' etc...).
>
> The problem is now your 'lexicon' has doubled in size. Try
> it yourself.
>
> Some people thing, 'why not use re (the Python regex
> module)?', because
> searching like '*ing' would require iterating over all the keys, a
> linear search like this could take multiple order of maginitude more
> time than a non-regex search.
>
> There is a pretty good compromise solution called n-grams,
> but they also
> result in a lexicon increase, and a much more complicated
> algorithm. I
> can refer you to a good book that describes them.
>
> The recent abstraction of the Catalog's 'lexicon' will
> eventually allow
> you plug custom lexicon objects into the catalog, giving you hooks to
> impliment this service yourself, if someone doesn't pay us to do it
> first.
>
> > Second, is there any way to use a list of synonyms when
> searching? For
> > example, a search for "headstone" should also search for
> "monument" and
> > "gravestone", and likewise a search for "gravestone" should
> also search for
> > "monument" and "headstone".
>
> Yes. The 'lexicon' has a hardwired 'synonym and stopword'
> dictionary in
> lib/python/SearchIndex/Lexicon.py. This is also projected to be
> improved by allowing through-the-web lexicon managment (like
> specifying
> stopwords and synonmys). Someone also suggested interfacing
> it to some
> kind of synonym database, you'd have to search through the arvhives to
> find the reference.
>
> > Am I asking too much of this? Should I be buying a Python
> book and adding
> > this functionality myself? Should I be using something
> other than ZCatalog?
> > Should I be using something other than Zope? (Please say
> no, I happen to
> > like Zope!)
>
> Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if
> you found fully featured regex searching in another package that would
> take less of a headache to use than just implimenting a simple
> 'reversed' lexicon that let's you do globbing (like dos
> wildcard, no *s
> in the middle of words, etc.).
>
> -Michel
>
> > Thank you!
> > -- Dave Kimmel
> > Systems Analyst
> > Office of the Public Trustee, Alberta Justice
>
> _______________________________________________
> Zope maillist - Zope@zope.org
> http://www.zope.org/mailman/listinfo/zope
>
> (Related lists - please, no cross posts or HTML encoding!
>
> To receive general Zope announcements, see:
> http://www.zope.org/mailman/listinfo/zope-announce
>
> For developer-specific issues, zope-dev@zope.org -
> http://www.zope.org/mailman/listinfo/zope-dev )
>