----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Matt Hamilton" <matth@netsight.co.uk> Cc: "Casey Duncan" <c.duncan@nlada.org>; "Steve Alexander" <steve@cat-box.net>; "Wolfram Kerber" <wk@gallileus.de>; <zope-dev@zope.org> Sent: Wednesday, November 28, 2001 09:27 Subject: Re: [Zope-dev] Catalog improvements
Matt Hamilton wrote:
I would like in on that too :) About a year or so ago I was working on
a
full-text indexing system for indexing several gigabytes of text (mailing list archives). Most of it was written in C and uses quite a lot of cool algorithms from various information retrieval papers and books. I have been hoping to have the time to take parts of it and work it into the new PluginIndex architecture. The existing code uses BerkeleyDB files to hold the index structures, but I would like to use ZODB instead to give it a bit more modularity.
Hi Matt,
Are any of these algorithms publicly available? I'd be _very_ interested in them :-)
I think the software "MG" from the book "Managing Gigabytes" is GPLed and currently released as mg-1.21. Walking through the TOC of the book, it seems to be a very detailed sources about text processing and gives very much informations about different indexes types. But I miss some explanations about current data structures like suffix arrays or suffix tree that have several advantages for text processing compared to B-Trees. Andreas --------------------------------------------------------------------- - Andreas Jung Zope Corporation - - EMail: andreas@zope.com http://www.zope.com - - "Python Powered" http://www.python.org - - "Makers of Zope" http://www.zope.org - - "Life is a fulltime occupation" - ---------------------------------------------------------------------