[Zope-dev] Catalog improvements

Matt Hamilton matth@netsight.co.uk
Tue, 27 Nov 2001 15:23:43 +0000 (GMT)


On Tue, 27 Nov 2001, Andreas Jung wrote:

> Is this code available for public ?

Sort of :)  It used to be around, but the server with it on is currently
offline and in need of a new disk controller, so it is not to hand.  It is
also poorly commented :( and written in very highly optimised (read:
illegible) C.

The main bits needed from it are the routines to store an retrieve
compressed lists of ascending integers (ie. used in indexes).  I want to
write a python wrapper around them and release a list-like python data
structure that will allow efficient storage of indexes.  The other bit is
the code for doing the cosine ranking similarity comparison in order to
rank the documents in order of relevance to a query.

Most of the code is taken from the book/code 'Managing Gigabytes'
by Witten, Moffat & Bell (http://www.cs.mu.OZ.AU/mg/)  The code is quite
old now (1999) and designed for quite large systems, or reletively static
text (ie. doesn't do incremental indexing very well).  I worked on
developing a 'forward' index which could be easily updated, and then
inverted quite quickly on a regular basis (since it didn't need to parse
the source text again).


-Matt

-- 
Matt Hamilton                                         matth@netsight.co.uk
Netsight Internet Solutions, Ltd.          Business Vision on the Internet
http://www.netsight.co.uk                               +44 (0)117 9090901
Web Hosting | Web Design  | Domain Names  |  Co-location  | DB Integration