[Zope-dev] Catalog improvements

Andreas Jung Andreas Jung" <andreas@zope.com
Wed, 28 Nov 2001 09:43:27 -0500


----- Original Message -----
From: "Chris Withers" <chrisw@nipltd.com>
To: "Matt Hamilton" <matth@netsight.co.uk>
Cc: "Casey Duncan" <c.duncan@nlada.org>; "Steve Alexander"
<steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org>
Sent: Wednesday, November 28, 2001 09:27
Subject: Re: [Zope-dev] Catalog improvements


> Matt Hamilton wrote:
> >
> > I would like in on that too :)  About a year or so ago I was working on
a
> > full-text indexing system for indexing several gigabytes of text
(mailing
> > list archives).  Most of it was written in C and uses quite a lot of
cool
> > algorithms from various information retrieval papers and books.  I have
> > been hoping to have the time to take parts of it and work it into the
new
> > PluginIndex architecture.  The existing code uses BerkeleyDB files to
hold
> > the index structures, but I would like to use ZODB instead to give it a
> > bit more modularity.
>
> Hi Matt,
>
> Are any of these algorithms publicly available? I'd be _very_ interested
in them
> :-)
>

I think the software "MG" from the book "Managing Gigabytes" is GPLed and
currently
released as mg-1.21. Walking through the TOC of the book, it seems to be a
very detailed
sources about text processing and gives very much informations about
different indexes types.
But I miss some explanations about current data structures like suffix
arrays or suffix tree
that have several advantages for text processing compared to B-Trees.

Andreas

    ---------------------------------------------------------------------
   -    Andreas Jung                            Zope Corporation       -
  -   EMail: andreas@zope.com                http://www.zope.com      -
 -  "Python Powered"                       http://www.python.org     -
  -   "Makers of Zope"                       http://www.zope.org      -
   -                  "Life is a fulltime occupation"                  -
    ---------------------------------------------------------------------