[ZODB-Dev] ZODB for spambayes server-side filter?

Mon Jan 12 11:10:42 EST 2004

On Mon, 2004-01-12 at 09:44, Simone Piunno wrote:
> I'm working on a server-side spam filter based on spambayes.
> After some prototyping with BDB4, I've started to look at ZODB.
> I'm trying to understand if this is a good idea.

It think it's an excellent match.  When spambayes was young, I actually
worked on such an integration with ZODB4, but had to abandon it; no time
for spambayes or ZODB4.

> The project design is the following:

[skipping most of the details...]

>  - users can choose to receive all the traffic, simply tagged, or they can
>    choose to block spam and/or unsures.  They will receive a daily report
>    on blocked email, so that just skimming at the from/subject list in the
>    report they could decide if a correction is requested.  Blocked email could
>    be unblocked and/or trained manually through the web, if you do it before
>    automatic expiration timeout.

Just a small suggestion: Sort the report by score so that the user can
identify the hammiest messages without too much trouble.

> I believe simple BDB is too flat to persist such a complex data structure, 
> therefore I've started looking at ZODB.  I'm fairly convinced that a 
> transactional storage is required here and it will be mostly read only: 
> writes will be only for training, stats update and configuration.
> After some benchmark, I got a 5-10x performance increase.

It sounds like you're right then :-).  ZODB should be more natural to
program with, and it's no surprise that it is faster than BDB.

> One main question is: how to avoid collision collapse?  I think at 1st approx 
> in case of transaction collision I can safely abort the SMTP connection and 
> wait for retry, but how can I be sure that more retries won't accumulate 
> collapsing the database?

This sounds like the most interesting question, but I'm not sure I
follow it.  What is "collision collapse?"  I think by collision, you
mean getting a conflict error.  If you use zodb 3.3 and mvcc (more on
that below), you'll only see conflicts for messages that perform
updates.  You can retry those updates a few times.  At worst, you could
generate SMTP errors and bounce those messages back to the user -- or
queue the messages internally for later processing.

It doesn't seem like conflicts on writes should be all that common. 
They might occur in the shared word probability database, but then you
could probably design some conflict avoidance techniques along the lines
of queued catalog, qv http://www.python.org/~jeremy/weblog/031031c.html

The typical processing can be handled using MVCC, which will eliminate
any read conflicts.  In cases where there is a conflict, you would get
the old data written before the conflict.  I don't think the stale data
would have much affect on your application, since you'd only see it when
a message arrived at the same time as an update.  In that case, you're
basically re-ordering the processing of concurrent activities, making it
look like the to-be-scored message arrived before the update.

Jeremy