We'd like to be able to have numeric-only fields that are searchable using TextIndex (eg. ISBN, telephone numbers, post codes, ...). We're just wondering what the logic is behind ZopeSplitter (and ISO_8859_1) rejecting words that only consist of numbers. Richard
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this: from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj')) gives ['abc', 't353', 'nmj'] Andreas ----- Original Message ----- From: "Richard Jones" <richard@bizarsoftware.com.au> To: <zope@zope.org> Sent: Monday, November 12, 2001 23:45 Subject: [Zope] Indexing: ZopeSplitter and numbers
We'd like to be able to have numeric-only fields that are searchable using TextIndex (eg. ISBN, telephone numbers, post codes, ...). We're just wondering what the logic is behind ZopeSplitter (and ISO_8859_1) rejecting words that only consist of numbers.
Richard
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:
from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))
gives ['abc', 't353', 'nmj']
Andreas
Has there been any thought in changing this behavior? I smell a fish bowl prop... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/
On Wednesday 14 November 2001 05:52, Casey Duncan wrote:
On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:
from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))
gives ['abc', 't353', 'nmj']
Andreas
Has there been any thought in changing this behavior? I smell a fish bowl prop...
Amen to the change, but does it really require a proposal? I get the feeling this is leaning seriously towards "bug" territory. Richard
I think this dedicated behaviour of the programmer (who ever wrote the code). I admit it is a limitation but I would not declare it as a bug. What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal. Andreas ----- Original Message ----- From: "Richard Jones" <richard@bizarsoftware.com.au> To: "Casey Duncan" <c.duncan@nlada.org>; "Andreas Jung" <andreas@andreas-jung.com>; <zope@zope.org> Sent: Tuesday, November 13, 2001 16:38 Subject: Re: [Zope] Indexing: ZopeSplitter and numbers
On Wednesday 14 November 2001 05:52, Casey Duncan wrote:
On Tuesday 13 November 2001 07:05 am, Andreas Jung allegedly wrote:
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:
from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))
gives ['abc', 't353', 'nmj']
Andreas
Has there been any thought in changing this behavior? I smell a fish bowl prop...
Amen to the change, but does it really require a proposal? I get the feeling this is leaning seriously towards "bug" territory.
Richard
Andreas Jung wrote:
What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal.
Well, I have a very open-architectured ZCatalog equivalent called Indexer. I was hoping to release it once I'd made it scare to full text searching 40,000 1-5 page documents. However, ZODB seems not to be able to handle this and is holding up the release. I actually have splitters written in python (no, not one that works for unicode yet...) sinze, if you think about it, splitting _doesnt_ need to be lightning quick since it only happens once when the document is indexed (and maybe when parsing the queru, but queries are usually very short...) If anyone is interested in this, help me make it scale. Then I can write docs and release it... cheers, Chris
On Tuesday 13 November 2001 08:01 pm, Chris Withers allegedly wrote:
Andreas Jung wrote:
What we really need is more more open architecture of the ZCatalog for things like splitters, stemmers etc. I have some ideas in mind but they have not find their place in a proposal.
Well, I have a very open-architectured ZCatalog equivalent called Indexer. I was hoping to release it once I'd made it scare to full text searching 40,000 1-5 page documents. However, ZODB seems not to be able to handle this and is holding up the release.
I actually have splitters written in python (no, not one that works for unicode yet...) sinze, if you think about it, splitting _doesnt_ need to be lightning quick since it only happens once when the document is indexed (and maybe when parsing the queru, but queries are usually very short...)
If anyone is interested in this, help me make it scale. Then I can write docs and release it...
cheers,
Chris
Hell Yes. Release early, release often!! /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.duncan@nlada.org \---------------------------------------------------/
On Tuesday 13 November 2001 23:05, Andreas Jung wrote:
The answer is - as always - in the sources ;-) The splitting algorithm is pretty dumb. Roughly spoken it splits the text in words but not into numbers. To test the splitter try this:
from ZopeSplitter import ZopeSplitter print list(ZopeSplitter('abc 123 t353 nmj'))
gives ['abc', 't353', 'nmj']
Yes, I realise - my question was more "what's the reasoning behind it" ... which appears to be "it's dumb" and therefore, can we please fix it? Richard
participants (4)
-
Andreas Jung -
Casey Duncan -
Chris Withers -
Richard Jones