Indexing: ZopeSplitter and numbers

Richard Jones

13 Nov 2001 13 Nov '01

4:45 a.m.

We'd like to be able to have numeric-only fields that are searchable using TextIndex (eg. ISBN, telephone numbers, post codes, ...). We're just wondering what the logic is behind ZopeSplitter (and ISO_8859_1) rejecting words that only consist of numbers. Richard

Show replies by date

Andreas Jung

13 Nov 13 Nov

12:05 p.m.

New subject: [Zope] Indexing: ZopeSplitter and numbers

The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this: from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj')) gives ['abc', 't353', 'nmj'] Andreas ----- Original Message ----- From: "Richard Jones" <richard@bizarsoftware.com.au> To: <zope@zope.org> Sent: Monday, November 12, 2001 23:45 Subject: [Zope] Indexing: ZopeSplitter and numbers

...

We'd like to be able to have numeric-only fields that are searchable using TextIndex (eg. ISBN, telephone numbers, post codes, ...). We're just wondering what the logic is behind ZopeSplitter (and ISO_8859_1) rejecting words that only consist of numbers.

Richard

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

Casey Duncan

6:52 p.m.

New subject: [Zope] Indexing: ZopeSplitter and numbers

On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:

...

The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:

from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))

gives ['abc', 't353', 'nmj']

Andreas

Has there been any thought in changing this behavior? I smell a fish bowl prop... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/

Richard Jones

9:38 p.m.

New subject: [Zope] Indexing: ZopeSplitter and numbers

On Wednesday 14 November 2001 05:52, Casey Duncan wrote:

...

On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:

...
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:

from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))

gives ['abc', 't353', 'nmj']

Andreas

Has there been any thought in changing this behavior? I smell a fish bowl prop...

Amen to the change, but does it really require a proposal? I get the feeling this is leaning seriously towards "bug" territory. Richard

Andreas Jung

10:37 p.m.

New subject: [Zope] Indexing: ZopeSplitter and numbers

I think this dedicated behaviour of the programmer (who ever wrote the code). I admit it is a limitation but I would not declare it as a bug. What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal. Andreas ----- Original Message ----- From: "Richard Jones" <richard@bizarsoftware.com.au> To: "Casey Duncan" <c.duncan@nlada.org>; "Andreas Jung" <andreas@andreas-jung.com>; <zope@zope.org> Sent: Tuesday, November 13, 2001 16:38 Subject: Re: [Zope] Indexing: ZopeSplitter and numbers

...

On Wednesday 14 November 2001 05:52, Casey Duncan wrote:

...
On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:

...
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:

from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))

gives ['abc', 't353', 'nmj']

Andreas

Has there been any thought in changing this behavior? I smell a fish bowl prop...

Amen to the change, but does it really require a proposal? I get the feeling this is leaning seriously towards "bug" territory.

Richard

Chris Withers

14 Nov 14 Nov

1:01 a.m.

New subject: [Zope] ZCatalog alternative

Andreas Jung wrote:

...

What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal.

Well, I have a very open-architectured ZCatalog equivalent called Indexer. I was hoping to release it once I'd made it scare to full text searching 40,000 1-5 page documents. However, ZODB seems not to be able to handle this and is holding up the release. I actually have splitters written in python (no, not one that works for unicode yet...) sinze, if you think about it, splitting _doesnt_ need to be lightning quick since it only happens once when the document is indexed (and maybe when parsing the queru, but queries are usually very short...) If anyone is interested in this, help me make it scale. Then I can write docs and release it... cheers, Chris

Casey Duncan

9:16 p.m.

New subject: [Zope] ZCatalog alternative

On Tuesday 13 November 2001 08:01 pm, Chris Withers allegedly wrote:

...

Andreas Jung wrote:

...
What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal.

Well, I have a very open-architectured ZCatalog equivalent called Indexer. I was hoping to release it once I'd made it scare to full text searching 40,000 1-5 page documents. However, ZODB seems not to be able to handle this and is holding up the release.

I actually have splitters written in python (no, not one that works for unicode yet...) sinze, if you think about it, splitting _doesnt_ need to be lightning quick since it only happens once when the document is indexed (and maybe when parsing the queru, but queries are usually very short...)

If anyone is interested in this, help me make it scale. Then I can write docs and release it...

cheers,

Chris

Hell Yes. Release early, release often!! /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/

Richard Jones

13 Nov 13 Nov

9:32 p.m.

New subject: [Zope] Indexing: ZopeSplitter and numbers

On Tuesday 13 November 2001 23:05, Andreas Jung wrote:

...

The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:

from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))

gives ['abc', 't353', 'nmj']

Yes, I realise - my question was more "what's the reasoning behind it" ... which appears to be "it's dumb" and therefore, can we please fix it? Richard

8960

Age (days ago)

8961

Last active (days ago)

List overview

7 comments

4 participants

participants (4)

Andreas Jung
Casey Duncan
Chris Withers
Richard Jones