[Zope-dev] Catalog improvements

Matt Hamilton matth@netsight.co.uk
Wed, 28 Nov 2001 14:55:45 +0000 (GMT)


On Wed, 28 Nov 2001, Andreas Jung wrote:

> I think the software "MG" from the book "Managing Gigabytes" is GPLed and
> currently
> released as mg-1.21. Walking through the TOC of the book, it seems to be a
> very detailed
> sources about text processing and gives very much informations about
> different indexes types.
> But I miss some explanations about current data structures like suffix
> arrays or suffix tree
> that have several advantages for text processing compared to B-Trees.

Suffix Trees/Tries take up a *lot* of space.  But they are very fast, and
useful for searching for substrings.  The main gist of the stuff in
'Managing Gigabytes' is that it is possible to store an ascending list of
integers in a compressed form, such that on average each integer requires
only 4 bits to represent it.  This is obviously much more compact than a
straight list of 32 or 64 bit integers/longs (plus any overhead python
adds to its inbuild list type).  The other point is that you can read and
decode the lists very quickly (you don't need to decompress the entire
list first before reading it).  Also consecutive numbers only take 1 bit
of storage, this means that 'stopwords' that are normally omitted from
indexes due to their very high frequency (and hence bloat of the index)
can be stored very efficiently.

One problem is that all of the research done in MG is based on much older
hardware than is currently availible and they try to make certain
optimisations, which nowadays don't save much time.

-Matt

-- 
Matt Hamilton                                         matth@netsight.co.uk
Netsight Internet Solutions, Ltd.          Business Vision on the Internet
http://www.netsight.co.uk                               +44 (0)117 9090901
Web Hosting | Web Design  | Domain Names  |  Co-location  | DB Integration