[Zope] Indexing: ZopeSplitter and numbers
Richard Jones
richard@bizarsoftware.com.au
Wed, 14 Nov 2001 08:46:07 +1100
On Wednesday 14 November 2001 08:26, sean.upton@uniontrib.com wrote:
> I think I'd have to jump on the bandwagon and agree that numbers should not
> be stripped. I'll second the idea of a fish-bowl proposal.
>
> In a full text search of classified ads, for example, one wants to search
> for a 2000 Ford F150; in Zope 2.3.x, Splitter.c stripped out both 2000 and
> F150. The change was easy: just replace isalpha() with isalnum() in the
> relevant part of the code. I'm not sure what the story is in 2.4, but it
> sounds like people searching for a year 2000 truck are going to find ads
> for ones built in 1982.
This is the behaviour we want - have you experienced any negative
side-effects from doing this?
> I use a modified Splitter.so that allows numbers, as well as one-character
> words, so people can search for "c programmer" in the classified ads.
>
> I'm curious about a few other things (that I really haven't tested):
> - How does Zope's splitter handle hyphenated words?
> - Is there a way to split words with period characters reliably, supposing
> I wanted to be able to search for terms like "yahoo.com" or "Splitter.so"
> or "Microsoft .NET" in text?
... or e-mail addresses. We
currently sub the "@" and "." chars in e-mail addresses with "_" so they are
indexed usefully. In your more case, I'm not sure that'd be appropriate. If
you only have "keywords" in your TextIndex, I suppose the only stop chars
you'd want are whitespace, and everything else is in.
> I would think that the appropriate default behavior for ZopeSplitter would
> be relaxed about stripping out things.
My concern is that there's _specific_ code in there that does this stuff, and
I want to know if there'll be any negative consquences of changing its
behaviour...
Richard