[Zope] Indexing: ZopeSplitter and numbers

sean.upton@uniontrib.com sean.upton@uniontrib.com
Tue, 13 Nov 2001 13:26:07 -0800


I think I'd have to jump on the bandwagon and agree that numbers should not
be stripped.  I'll second the idea of a fish-bowl proposal.

In a full text search of classified ads, for example, one wants to search
for a 2000 Ford F150; in Zope 2.3.x, Splitter.c stripped out both 2000 and
F150.  The change was easy: just replace isalpha() with isalnum() in the
relevant part of the code.  I'm not sure what the story is in 2.4, but it
sounds like people searching for a year 2000 truck are going to find ads for
ones built in 1982.

I use a modified Splitter.so that allows numbers, as well as one-character
words, so people can search for "c programmer" in the classified ads.

I'm curious about a few other things (that I really haven't tested):
- How does Zope's splitter handle hyphenated words?
- Is there a way to split words with period characters reliably, supposing I
wanted to be able to search for terms like "yahoo.com" or "Splitter.so" or
"Microsoft .NET" in text?

I would think that the appropriate default behavior for ZopeSplitter would
be relaxed about stripping out things.

Sean

-----Original Message-----
From: Andreas Jung [mailto:andreas@zope.com]
Sent: Tuesday, November 13, 2001 11:08 AM
To: Casey Duncan; richard@bizarsoftware.com.au; zope@zope.org
Subject: Re: [Zope] Indexing: ZopeSplitter and numbers


Zope 2.4.X allows to have multiple splitters. So you can write your own
splitter.
The only disadvantage is that there is currently no offical API (except
monkeypatching)
to add custom splitters (but there is a already a proposal in the fishbowl
to address
this problem).

Andreas
----- Original Message -----
From: "Casey Duncan" <c.duncan@nlada.org>
To: "Andreas Jung" <andreas@andreas-jung.com>;
<richard@bizarsoftware.com.au>; <zope@zope.org>
Sent: Tuesday, November 13, 2001 13:52
Subject: Re: [Zope] Indexing: ZopeSplitter and numbers


> On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:
> > The answer is - as always - in the sources ;-) The splitting algorithm
is
> > pretty dumb. Roughly spoken it splits the text in words but not into
> > numbers.
> > To test the splitter try this:
> >
> > from ZopeSplitter import ZopeSplitter
> > print list(ZopeSplitter('abc 123 t353 nmj'))
> >
> > gives ['abc', 't353', 'nmj']
> >
> >
> > Andreas
>
> Has there been any thought in changing this behavior? I smell a fish bowl
> prop...
>
> /---------------------------------------------------\
>   Casey Duncan, Sr. Web Developer
>   National Legal Aid and Defender Association
>   c.duncan@nlada.org
> \---------------------------------------------------/
>




_______________________________________________
Zope maillist  -  Zope@zope.org
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )