[Zope-dev] Building A MailMan Search Interface

Kapil Thangavelu kvthan@wm.edu
Sun, 16 Jul 2000 16:46:29 -0700


I've come across an itch. I'm tired of having to go through the mailing
archives to find what i need. The search interface at egroups, is a bit
slow and cumbersome. The one at ntlpd is much nicer, but i'd like to
have my own so i can point an archiver/search interface at any mailman
mailing list.

So i decided i'd like to make a generic mailman search interface in
zope. i've got a cronable retrieval script working that grabs archives.
the next step is pretty crucial and i thought i'd ask around for advice.

My question than becomes one of storage and parsing. I'm looking for
suggestions on how to do this in an efficient and speedy manner. I'm
willing to use linux/nix specific stuff if it helps performance.

options:
first question that applies to most of these approaches is whether to
store mails an individual items or in default format of the text
archive.

flat file:
this gives me a couple of parsing/searching options. like using grep or
the c regexp library or the camel library (from helix code's evolution)
any other options for this format?

downside, this introduces some minor hurdles with presentation. 

zodb - btree folder
for parsing/indexing this basically forces me to use zcatalog, which i
don't think will scale to the amount of raw text without lotsa of ram. i
could be wrong (i haven't gone through the Catalog code), but this is my
working understanding of it. 

if i store the emails as archives i could probably whip up a reasonably
speedy external method that would search through them.

one benefit will be the ease of the presentation logic. but this is
secondary to a speedy system.

rdbms (probably postgres - maybe mysql)
i'd prefer postgres since i'll probably be doing some other work with
it. but %like% is probably one of the most expensive operations you can
use on a db and its pretty limited in syntax. if i had a spare oracle
system than i'd drop it in a heartbeat and use Intertext Media jaunx for
searching. But i'd hate to tie this to a very expensive closed system.
mysql seems to excel at speed (perhaps because it was designed for it:)
but again the limitations of sql search syntax pop up. if anyone knows
of any good ways to search through text in a db i'd love to hear about
em.


right now, i'm leaning slightly towards a flat file storage, but i'd
love to hear some suggestions.

Cheers

Kapil