RE: [Zope] zcatalog -- returning context of hits on fulltext

newer
Re: Xron seems to not work

older
RE: [Zope] substring search on...

Jean Jordaan

14 Aug 2000 14 Aug '00

9:54 a.m.

Hi Geir

...

make a pythonmethod that returns the first 200 letters or something of the text ,

I've already got a pretty structured-text "Abstract" field that tells about the document, but I'd like to *see* the sentence on page 67 or wherever in a document where my term matches, so I know whether it's mentioned in passing or really important .. -- jean

Show replies by date

Chris Withers

14 Aug 14 Aug

12:04 p.m.

New subject: [Zope] zcatalog -- returning context of hits on fulltext

Jean Jordaan wrote:

...

I've already got a pretty structured-text "Abstract" field that tells about the document, but I'd like to *see* the sentence on page 67 or wherever in a document where my term matches, so I know whether it's mentioned in passing or really important ..

erk... that's a little harder :S I don't know if Catalog can do it, but at the very least you'd need a reference to your object to search the whole text, which means you loose the 'cool' metadata feature of not sucking a lot fo resource for search results. cheers, Chris

Toby Dickenson

12:39 p.m.

New subject: [Zope] zcatalog -- returning context of hits on fulltext

On Mon, 14 Aug 2000 13:04:49 +0100, Chris Withers <chrisw@nipltd.com> wrote:

...

Jean Jordaan wrote:

...
I've already got a pretty structured-text "Abstract" field that tells about the document, but I'd like to *see* the sentence on page 67 or wherever in a document where my term matches, so I know whether it's mentioned in passing or really important ..

erk... that's a little harder :S

I don't know if Catalog can do it, but at the very least you'd need a reference to your object to search the whole text, which means you loose the 'cool' metadata feature of not sucking a lot fo resource for search results.

If you really do have a 67 page document, it would be better to store each page in its own ZODB object, and index each page individually. With that scheme your search results page only has to load a few pages, rather than a few documents. Toby Dickenson tdickenson@geminidataloggers.com

Jimmie Houchin

2:34 p.m.

New subject: [Zope] zcatalog -- returning context of hits on fulltext

Hello, I may be clueless and out of my league here and I haven't read the sources so I don't know... Well enough of a disclaimer. :) Is there anything in there which can provide the seek or byte position of the hit within text object? If so, it shouldn't be too difficult to read X bytes before and after the position and thereby provide what your looking for. This would be nice to have out of the box. Just a thought. Jimmie Houchin Jean Jordaan wrote:

...

Hi Geir

...
make a pythonmethod that returns the first 200 letters or something of the text ,

I've already got a pretty structured-text "Abstract" field that tells about the document, but I'd like to *see* the sentence on page 67 or wherever in a document where my term matches, so I know whether it's mentioned in passing or really important ..

-- jean

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

Jimmie Houchin

2:43 p.m.

New subject: [Zope] zcatalog -- returning context of hits on fulltext

...

Hi Geir

...
make a pythonmethod that returns the first 200 letters or something of the text ,

I've already got a pretty structured-text "Abstract" field that tells about the document, but I'd like to *see* the sentence on page 67 or wherever in a document where my term matches, so I know whether it's mentioned in passing or really important ..

-- jean

_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )

R. David Murray

15 Aug 15 Aug

3:10 a.m.

New subject: [Zope] zcatalog -- returning context of hits on fulltext

On Mon, 14 Aug 2000, Jimmie Houchin wrote:

...

I may be clueless and out of my league here and I haven't read the sources so I don't know... Well enough of a disclaimer. :)

I *have* read the ZCatalog/SearchIndex sources, but I don't understand this part of it yet (or really that much of it at all!). I think we're getting into zope-dev terratory here...

...

Is there anything in there which can provide the seek or byte position of the hit within text object? If so, it shouldn't be too difficult to read X bytes before and after the position and thereby provide what your looking for.

The standard TextIndex implementation records a notion of "position" for each occurence of each word indexed. I *think* this position is a word count position, but I'm not sure. Part of the code references a 'row', but it isn't at all clear that that has any relationship to a source record. If it is a word count, the other thing you'd need to check would be whether it is a word count before or after splitter activity. I think it's the latter, which makes things more complicated. Or just means you have to use more fuzz in your context <grin>.

...

This would be nice to have out of the box.

The TextIndex 'position' information is intended to be used for the 'near' operator (...) (so you can search on multiple words "close" to each other for some definition of close). You could also use it to enforce word order (Maybe the "" operator does that?). Currently I think the result of applying the near operator is used to adjust the "weight" of the index match, which affects the order of the results returned. (I haven't tested to see if any of this works!) So, the basic information you are looking for is there in some sense to establish the position, but you'd still have to retrieve the original sentences from the object itself, or from a full-text metadata field. Both of these are going to be memory intensive operations. If you index based on, say, individual lines, you'd loose some of the the benefits of the near operator, though. So I'd say indexing based on paragraphs would probably be your best approach. This would also help mask position errors introduced if the word count is indeed post-splitter. Of course, you'll have to decend to python to get access to the methods that will return the actual position information. But at least the code to record it is already there. Take a look at lib/python/SearchIndex/TextIndex.py for source enlightenment. --RDM

9364

Age (days ago)

9365

Last active (days ago)

List overview

5 comments

5 participants

participants (5)

Chris Withers
Jean Jordaan
Jimmie Houchin
R. David Murray
Toby Dickenson