RE: [Zope] ZCatalog searching questions

14 Oct 1999

      I just came across this post of yours. Curious: Have you guys looked at
ternary search trees [1] as an alternative dictionary algorithm?

Compared to B-trees and their variants, ternary trees are extremely
space-efficient, and lend themselves greatly to partial searching --
such as finding words starting with a certain substring, or pure
wildcard matches -- and nearest-neighbour searching. They're not just
simple in structure, they're lightning fast, too: Searching for a string
of length k in a tree containing n strings will require at most O(log
n+k) comparisons. A semirecent article [2] explained the concept in
detail. I've successfully employed them and can personally verify their
usefulness and speed.

Because they're light, the I/O required to process searches is negible
compared to other methods such as hash tables or binary trees (and their
variants). Unlike binary trees, non-recursive search functions are just
as simple to write as recursive ones (the only stack space theoretically
needed is the search string and your node representation); balancing is
generally not needed, although it is possible to enforce; finally,
ternary trees have low overhead for insertion and successful searches.

Ternary trees get their names from having three children per node,
contrary to the binary tree family, which use two. Each node contains a
"split character" and three "low", "equal", and "high" child pointers.
The split character is an 8-bit (or 16-bit UNICODE) character value.

Let's say we store the words FOO, FRED, FRUIT, ZOPE, and BAR, in that
order. The three built is thus (I use "." to denote a null splitchar):

                        F
                       /|\
                      B O \
                      | |\ \
                      A O R \
                      | | |  \
                      R . E   \
                      |   |\   Z
                      .   D U  |
                          | |  O
                          . I  |
                            |  P
                            T  |
                            |  E
                            .  |
                               .

Let's say we're searching for the word "FRUIT\0" (a terminating null
character is used to avoid matching with stuff like "FRUITCAKE"; if we
leave out the terminating null, this becomes a partial-match search).
First, we find the root node, "F". It's equal to our starting character,
so we increment our character pointer and go to the node's "equal"
child, "O". R > O, so we proceed to the "high" child and find "R". R ==
R, so we increment our pointer, proceed to the "equal" child, and find
"E". U > E, so we proceed to the "high" child, and we find "U". U == U,
so we increment our pointer, and so on. We know we've found a singular
entry for "FRUIT" because the final splitchar is null.

How to traverse, and insert into, this tree should be obvious to anyone.
Notice how easily partial/expression searches can be done, since the
tree itself lends itself to string-like operations.

.. [1] "Fast Algorithms for Sorting and Searching Strings", Jon
Bentley/Bob Sedgewick, Proceedings of the 8th Annual ACM-SIAM Symposium
on Discrete Algorithms, January 1997.

.. [2] Ternary Search Trees, Jon Bentley/Bob Sedgewick, Dr. Dobb's
Journal, April 1998.

I wrote a ternary tree module in Python a while ago. You can get it
here: http://www.mop.no/~alex/download/pytrees.tgz. It does not serve to
show any benefits in speed or footprint: Python's memory management,
string handling, and garbage collection facilities are not tailored to
handle this stuff efficiently, so in Python binary trees have the upper
hand. It should, however, serve as a nice demo of the algorithm itself.

-- 
Alexander Staubo             http://alex.mop.no/
"In the end, we all assume room temperature." --John Maynard Keynes
...
-----Original Message-----
From: Michel Pelletier [mailto:michel@digicool.com]
Sent: 30. september 1999 07:55
To: Dave Kimmel
Cc: 'zope@zope.org'
Subject: Re: [Zope] ZCatalog searching questions
Dave Kimmel wrote:
...
Hello all
I'm trying to re-implement the internal policy manual here
at work using
...
Zope (it's currently hyperlinked WordPerfect documents), 
and after demoing
it to management there are now two things I need to solve 
with the search
feature.
First, is there any way I can search by regular expression 
or search for
part of a word?  Specifically, I need to search the content 
of a bunch of
DTML Documents (these only contain the text of the policies 
and references
to standard_html_header and standard_html_footer) for 
partial word matches.
For example, "vac" needs to match "vacation", "vacations", 
"vaccination",
etc.
No.  This is a vastly harder problem than it sounds like.  Think of a
real book index, how would arrange the index 'keys' so that 
you could do
something like '*ing'?  The immediate solution is to store two
'sub-keys' per word, one with the word forward 'walking' and one with
the word backwards ('gniklaw').  This way you can say walk* 
or *ing and
get walking in both instances (and 'walked', 'zopeing' etc...).
The problem is now your 'lexicon' has doubled in size.  Try 
it yourself.
Some people thing, 'why not use re (the Python regex 
module)?', because
searching like '*ing' would require iterating over all the keys, a
linear search like this could take multiple order of maginitude more
time than a non-regex search.
There is a pretty good compromise solution called n-grams, 
but they also
result in a lexicon increase, and a much more complicated 
algorithm.  I
can refer you to a good book that describes them.
The recent abstraction of the Catalog's 'lexicon' will 
eventually allow
you plug custom lexicon objects into the catalog, giving you hooks to
impliment this service yourself, if someone doesn't pay us to do it
first.
...
Second, is there any way to use a list of synonyms when 
searching?  For
example, a search for "headstone" should also search for 
"monument" and
"gravestone", and likewise a search for "gravestone" should 
also search for
"monument" and "headstone".
Yes.  The 'lexicon' has a hardwired 'synonym and stopword' 
dictionary in
lib/python/SearchIndex/Lexicon.py.  This is also projected to be
improved by allowing through-the-web lexicon managment (like 
specifying
stopwords and synonmys).  Someone also suggested interfacing 
it to some
kind of synonym database, you'd have to search through the arvhives to
find the reference.
...
Am I asking too much of this?  Should I be buying a Python 
book and adding
this functionality myself?  Should I be using something 
other than ZCatalog?
Should I be using something other than Zope?  (Please say 
no, I happen to
like Zope!)
Go for it, but don't give up on ZCatalog or Zope, I'd be surprised if
you found fully featured regex searching in another package that would
take less of a headache to use than just implimenting a simple
'reversed' lexicon that let's you do globbing (like dos 
wildcard, no *s
in the middle of words, etc.).
-Michel
...
Thank you!
-- Dave Kimmel
Systems Analyst
Office of the Public Trustee, Alberta Justice
_______________________________________________
Zope maillist  -  Zope@zope.org
http://www.zope.org/mailman/listinfo/zope
(Related lists - please, no cross posts or HTML encoding!
To receive general Zope announcements, see:
http://www.zope.org/mailman/listinfo/zope-announce
For developer-specific issues, zope-dev@zope.org -
http://www.zope.org/mailman/listinfo/zope-dev )