ZCatalog phrase indexing revisited
Hi all, I'm asking this again ,in case anything has changed since the last time. I'm trying to search for a phrase in ZCatalog, but, it doesn't seem to work. Let's say I want to search for "foo bar", if i ask ZCatalog to find it, it treats it as if I wrote "foo OR bar" . Last time I asked this, SteveA kindly replied: You should use quotes to group the phrase. ['"foo bar"'] Take a look in the source code: lib/python/SearchIndex/UnTextIndex.py Follow the flow of code through from line 550: def query( ... however, this doesn't seem to work. You can try it even on the zope.org website using their search interface. Using the above suggested syntax (incl. the brackets) returns an error message. Can anyone help me ? Thx. oren.
Quoted phrase searching doesn't really work. Well, it does, but not the way you'd expect. Zope versions below 2.3.1b2 (all the way down to 2.2.1 AFAICT) used to choke and error out on quoted-phrase searching. But quoted-phrase searching in Zopes past 2.3.1b2 is essentially turned into an "AND" query instead of erroring out. So if you do "foo bar", it's roughly equivalent to "foo AND bar". This of course isn't what most people expect, but the machinery for NEAR searching in the catalog was used by the quoting operator (e.g. "foo bar" would become "foo NEAR bar"). Unfortunately, the NEAR searching machinery *never* worked (and still doesnt), so we had to turn the NEAR into an AND to get a reasonable, if misleading, result. Hopefully we can graft on real NEAR searching in the future. For now, I think "foo AND bar" is about as close as you're going to get to phrase searching without post-filtering results. ----- Original Message ----- From: "Oren Yosifon" <oren@mindcitetech.com> To: <zope-dev@zope.org> Sent: Thursday, March 29, 2001 9:35 AM Subject: [Zope-dev] ZCatalog phrase indexing revisited Hi all, I'm asking this again ,in case anything has changed since the last time. I'm trying to search for a phrase in ZCatalog, but, it doesn't seem to work. Let's say I want to search for "foo bar", if i ask ZCatalog to find it, it treats it as if I wrote "foo OR bar" . Last time I asked this, SteveA kindly replied: You should use quotes to group the phrase. ['"foo bar"'] Take a look in the source code: lib/python/SearchIndex/UnTextIndex.py Follow the flow of code through from line 550: def query( ... however, this doesn't seem to work. You can try it even on the zope.org website using their search interface. Using the above suggested syntax (incl. the brackets) returns an error message. Can anyone help me ? Thx. oren. _______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
On Thu, 29 Mar 2001, Chris McDonough wrote:
Hopefully we can graft on real NEAR searching in the future. For now, I think "foo AND bar" is about as close as you're going to get to phrase searching without post-filtering results.
Me got a patch: <URL:http://nittin.net/erik/software/PossitionIndex>. You'll have to do some importing in Products/ZCatalog/Catalog.py to make things work (and modify the parse() and parse2() to work with AdjoinedBy). It isn't tested much and should really be put through a Fishbowl project first, but I've got no time for that unfortunately. If someone else would like to do that, they are welcome to :) To be really useful I think the PossitionIndex' _proximity dictionary needs to be turned into a BTree of some sort, but apart from that I don't know what is missing. It was hard to get my head around SearchIndex and write a new index all in the same day, so there might be some "design-errors" in PossitionIndex/ResultList.py. Although, I didn't add at much to it. And speed might be a problem, haven't really tested that yet. Will during the weekend though.
On Thu, 14 Jun 2001, Erik Enge wrote:
Me got a patch: <URL:http://nittin.net/erik/software/PossitionIndex>.
And I should mention that it has only been tested on Zope 2.3.2. (BTW, thanks, Chris, for suggesting how to code it.)
Excellent! I haven't looked at it in detail, but thanks very much for contributing it! Maybe we can roll some of this work into a position-aware Text Index, or maybe even a new kind of Pluggable Index. - C ----- Original Message ----- From: "Erik Enge" <erik@thingamy.net> To: "Chris McDonough" <chrism@digicool.com> Cc: "Oren Yosifon" <oren@mindcitetech.com>; <zope-dev@zope.org> Sent: Thursday, June 14, 2001 12:45 PM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Thu, 14 Jun 2001, Erik Enge wrote:
Me got a patch: <URL:http://nittin.net/erik/software/PossitionIndex>.
And I should mention that it has only been tested on Zope 2.3.2.
(BTW, thanks, Chris, for suggesting how to code it.)
On Thu, 14 Jun 2001, Chris McDonough wrote:
Excellent! I haven't looked at it in detail, but thanks very much for contributing it! Maybe we can roll some of this work into a position-aware Text Index
It is actually a TextIndex on steoroids. Remove the _proximity attribute and a couple of methods and what you are left with is a standard TextIndex. So I think what you already have is a position-aware TextIndex. That's how I'm planning to use it anyway :)
or maybe even a new kind of Pluggable Index.
:-)
On Thu, 14 Jun 2001, Erik Enge wrote:
To be really useful I think the PossitionIndex' _proximity dictionary needs to be turned into a BTree of some sort, but apart from that I don't know what is missing.
It's now using BTrees. And I renamed it to PositionIndex (thanks to Chris Withers for this :-).
And speed might be a problem, haven't really tested that yet. Will during the weekend though.
I indexed 30.000 objects using PositionIndex and searching (both exact-phrase and near) is very fast. It doesn't seem to be bloated, either (the _proximity-attribute, that is). Do you guys have a testing-suite for indexes? Maybe some I can apply to this index of mine?
There is a *small* testsuite for testing TextIndex in Products/PluginIndexes/tests/testTextIndex.py Andreas ----- Original Message ----- From: "Erik Enge" <erik@thingamy.net> To: "Chris McDonough" <chrism@digicool.com> Cc: <zope-dev@zope.org> Sent: Friday, June 15, 2001 11:53 AM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexing revisited)
On Thu, 14 Jun 2001, Erik Enge wrote:
To be really useful I think the PossitionIndex' _proximity dictionary needs to be turned into a BTree of some sort, but apart from that I don't know what is missing.
It's now using BTrees. And I renamed it to PositionIndex (thanks to Chris Withers for this :-).
And speed might be a problem, haven't really tested that yet. Will during the weekend though.
I indexed 30.000 objects using PositionIndex and searching (both exact-phrase and near) is very fast. It doesn't seem to be bloated, either (the _proximity-attribute, that is).
Do you guys have a testing-suite for indexes? Maybe some I can apply to this index of mine?
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Erik, Once you're satisfied with the implementation, would you be willing submit the module to the collector? - C ----- Original Message ----- From: "Erik Enge" <erik@thingamy.net> To: "Chris McDonough" <chrism@digicool.com> Cc: <zope-dev@zope.org> Sent: Friday, June 15, 2001 11:53 AM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Thu, 14 Jun 2001, Erik Enge wrote:
To be really useful I think the PossitionIndex' _proximity dictionary needs to be turned into a BTree of some sort, but apart from that I don't know what is missing.
It's now using BTrees. And I renamed it to PositionIndex (thanks to Chris Withers for this :-).
And speed might be a problem, haven't really tested that yet. Will during the weekend though.
I indexed 30.000 objects using PositionIndex and searching (both exact-phrase and near) is very fast. It doesn't seem to be bloated, either (the _proximity-attribute, that is).
Do you guys have a testing-suite for indexes? Maybe some I can apply to this index of mine?
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Will do. Have you thought about how users actually are to use exact-phrase? What I'm thinking I will do here (currently I've only been testing explicitly with "adjoinedby" in the query) is to insert "adjoinedby" in phrased searches: "erik enge" -> erik adjoinedby enge erik ... enge -> erik near enge What do you think? I'll be submitting PositionIndex.py and ResultList.py in a day or two.
Erik Enge wrote:
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Will do. Have you thought about how users actually are to use exact-phrase? What I'm thinking I will do here (currently I've only been testing explicitly with "adjoinedby" in the query) is to insert "adjoinedby" in phrased searches:
"erik enge" -> erik adjoinedby enge erik ... enge -> erik near enge
What do you think?
These both look like good spellings, and I think "erik near enge" would be a good alias for "erik ... enge" as well.. - C
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data. Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated. I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector. [1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
On Sun, 17 Jun 2001 21:05:47 +0200 (CEST) Erik Enge <erik@thingamy.net> wrote:
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data.
Erik, It looks like you call proximityInsert for each item returned from the splitter on the doc source. Instead of looking for the position in the source document by splitting the source up again within proximityInsert, you can keep a simple counter while you iterate over the splitter return in index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate the position entry for that word/documentId pair within proximityInsert. You never actually need to manually split the document source, instead just always rely on the splitter to bust up the doc, and manipulate the position list in place. This is not the most efficient way, but it's more efficient than your current way. Therefore, the bit in index_object becomes: i = 0 for word in splitter(source): self.proximityInsert(word, documentId, i) i = i + 1 The proximityInsert method becomes: def proximityInsert(self, word, documentId, i): """Insert proximity information about this wid (word id) in the index' proximity bucket.""" wid=self.getWid(word) prox=self._proximity if not prox.has_key(wid): prox[wid]=IOBTree() prox[wid][documentId]=[i] self._p_changed = 1 else: if i in prox[wid][documentId]: return prox[wid][documentId].append(i) self._p_changed = 1 .. and the positionInDoc method goes away. I didn't scan too hard for what else in the source this would break.
Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated.
I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector.
Cool...
[1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
It just occurred to me that depending on the splitter to do positions makes it impossible to alter the splitter without reindexing the whole text index... but I think this is a reasonable tradeoff. Other opinions welcome. On Sun, 17 Jun 2001 15:57:20 -0400 "Chris McDonough" <chrism@digicool.com> wrote:
On Sun, 17 Jun 2001 21:05:47 +0200 (CEST) Erik Enge <erik@thingamy.net> wrote:
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data.
Erik,
It looks like you call proximityInsert for each item returned from the splitter on the doc source. Instead of looking for the position in the source document by splitting the source up again within proximityInsert, you can keep a simple counter while you iterate over the splitter return in index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate the position entry for that word/documentId pair within proximityInsert. You never actually need to manually split the document source, instead just always rely on the splitter to bust up the doc, and manipulate the position list in place. This is not the most efficient way, but it's more efficient than your current way.
Therefore, the bit in index_object becomes:
i = 0 for word in splitter(source): self.proximityInsert(word, documentId, i) i = i + 1
The proximityInsert method becomes:
def proximityInsert(self, word, documentId, i): """Insert proximity information about this wid (word id) in the index' proximity bucket.""" wid=self.getWid(word) prox=self._proximity if not prox.has_key(wid): prox[wid]=IOBTree() prox[wid][documentId]=[i] self._p_changed = 1 else: if i in prox[wid][documentId]: return prox[wid][documentId].append(i) self._p_changed = 1
.. and the positionInDoc method goes away.
I didn't scan too hard for what else in the source this would break.
Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated.
I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector.
Cool...
[1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
Chris McDonough wrote:
It just occurred to me that depending on the splitter to do positions makes it impossible to alter the splitter without reindexing the whole text index... but I think this is a reasonable tradeoff. Other opinions welcome.
This raises the question how dependent the splitter on the paticularities of the document source - I do not really see how different splitters could be useful for one single document. This is perhaps less obvious than it appears, as you may want to use different splitters for documents in different languages. Taken as a whole I would say choosing a splitter would be a decision that had to be taken at indexing time anyway. But perhaps it's just my imagination that is lacking. There is a much greater dependence on the lexicon here. And indeed several different lexicons could be applied to a set of documents depending of what is wanted. my 2 cents Rik
Rik Hoekstra writes:
This raises the question how dependent the splitter on the paticularities of the document source - I do not really see how different splitters could be useful for one single document. This is perhaps less obvious than it appears, as you may want to use different splitters for documents in different languages. Taken as a whole I would say choosing a splitter would be a decision that had to be taken at indexing time anyway. But perhaps it's just my imagination that is lacking. There are lots of things you may want to change based on experience with your index:
* change the set of token boundary characters they define, where words are broken out. * change the set of removed characters they are removed from the words, usually for normalization. In German, e.g., you can write both "Auto-Lackierer" and "Autolackierer". You want to normalize these different spellings. * change the set of "composing" characters German is very rich in composite terms. You may want to index under each component term. For this, you need the rules on how the composition is build. For text, it is usually '-'. But if you have computer sources, '_' or ':' may be relevant, too. Of couse, the search must follow the same splitting rules than the indexing did. Changing the rules (the splitter or its configuration) after indexing will make the index inconsistent. Dieter
These are good ideas to improve the TextIndex. I already encouraged Erik to put alltogether into a Fishbowl proposal, Andreas ----- Original Message ----- From: "Dieter Maurer" <dieter@handshake.de> To: "Rik Hoekstra" <rik.hoekstra@inghist.nl> Cc: "Chris McDonough" <chrism@digicool.com>; "Erik Enge" <erik@thingamy.net>; <zope-dev@zope.org> Sent: Monday, June 18, 2001 4:59 PM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Rik Hoekstra writes:
This raises the question how dependent the splitter on the paticularities of the document source - I do not really see how different splitters could be useful for one single document. This is perhaps less obvious than it appears, as you may want to use different splitters for documents in different languages. Taken as a whole I would say choosing a splitter would be a decision that had to be taken at indexing time anyway. But perhaps it's just my imagination that is lacking. There are lots of things you may want to change based on experience with your index:
* change the set of token boundary characters they define, where words are broken out.
* change the set of removed characters they are removed from the words, usually for normalization.
In German, e.g., you can write both "Auto-Lackierer" and "Autolackierer". You want to normalize these different spellings.
* change the set of "composing" characters
German is very rich in composite terms. You may want to index under each component term. For this, you need the rules on how the composition is build. For text, it is usually '-'. But if you have computer sources, '_' or ':' may be relevant, too.
Of couse, the search must follow the same splitting rules than the indexing did. Changing the rules (the splitter or its configuration) after indexing will make the index inconsistent.
Dieter
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
On Mon, 18 Jun 2001, Andreas Jung wrote:
These are good ideas to improve the TextIndex. I already encouraged Erik to put alltogether into a Fishbowl proposal,
Which I would do, if I had time. Which I will have, but not for another two weeks. :-)
On Mon, 18 Jun 2001, Andreas Jung wrote:
These are good ideas to improve the TextIndex. I already encouraged Erik to put alltogether into a Fishbowl proposal,
Which I would do, if I had time. Which I will have, but not for another two weeks. :-)
I'm guessing this is the point at which your problems become mine? ;-) *grinz* Chris
On Tue, 19 Jun 2001, Chris Withers wrote:
I'm guessing this is the point at which your problems become mine? ;-)
*evil laughter* Yes :-) We should write about it and publish it to the community...
Rik Hoekstra writes:
This raises the question how dependent the splitter on the paticularities of the document source - I do not really see how different splitters could be useful for one single document. This is perhaps less obvious than it appears, as you may want to use different splitters for documents in different languages. Taken as a whole I would say choosing a splitter would be a decision that had to be taken at indexing time anyway. But perhaps it's just my imagination that is
Of couse, the search must follow the same splitting rules than the indexing did. Changing the rules (the splitter or its configuration) after indexing will make the index inconsistent.
I agree; in fact I think we're saying the same. What is more interesting, is how (less than when) you decide to use which splitter. With heterogeneous documents I'd think it would be difficult to decide automagically... Rik
On Sun, 17 Jun 2001, Chris McDonough wrote:
index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate
Is this part of the current formal Splitter Interface? If not, it needs to be if other code is going to depend on it. Oh, yeah, and where is the formal Splitter interface documented <grin>? I don't see anything in SearchIndex, and a search for "splitter interface" on zope.org didn't turn up anything useful. --RDM
The Splitter interface is not really document. However Zope 2.4 has a much better support for 3rd party splitters. Andreas ----- Original Message ----- From: "R. David Murray " <bitz@bitdance.com> To: "Chris McDonough" <chrism@digicool.com> Cc: "Erik Enge" <erik@thingamy.net>; <zope-dev@zope.org> Sent: Monday, June 18, 2001 11:39 AM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Sun, 17 Jun 2001, Chris McDonough wrote:
index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate
Is this part of the current formal Splitter Interface? If not, it needs to be if other code is going to depend on it.
Oh, yeah, and where is the formal Splitter interface documented <grin>? I don't see anything in SearchIndex, and a search for "splitter interface" on zope.org didn't turn up anything useful.
--RDM
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data.
Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated.
I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector.
Doing something similar (in fact what I needed was citations of word usage) I took a two step approach, with the idea that most of the actual returning of results would have to be done on a much smaller subset of documents than if you'd have to index all documents with word indexes and positions. I use a normal textindex for querying. Then if a document is returned by the query I start processing the documents. This requires parsing the query in a slightly different way (throw out the NOTs). The two step approach has the advantage that you can postpone processing actual documents until you return the results for the specific documents. Using your positionInDoc will require a _lot_ of processing (why does it use string.split btw and not Splitter?; why split on " " and not on string.whitespace?). I have used string.find for finding word positions, which is probably faster than looping a list of words. BTW, I'd rather use Splitter, but word positions appeared not to be reliable (bug, or something I didn't understand; anyhow, string.find works for me and is fast) def splitit(txt, word): postions = [] start = 0 while 1: res = string.find(txt, word, start) if res is -1: break else: start = res+1 postions.append(res) return postions <sidenote>Perhaps using re would perhaps also be an option, but allowing regular expressions will complicate searching a lot, so I use globbing lexicon for expanding and then do the matching on the expanded items (if necessary - not if using [wordpart]*)</sidenote> Advantages of using this approach: - it's faster. - it splits up the query processing part in different subparts which also contributes to speeding things up. - it's also more flexible, as you can divide searching and parsing over different webrequests, and even make them dependend on the number of results. For example: why return text fragments from all documents if your users will not be able to see all the results anyway. Or why return all fragments containing word combinations from one single document while returning a few occurrences from different documents is more useful for your users. Note that this will mainly affect returning text fragments, which may or may not be useful. There's also a couple of disadvantages (as I see them , but there may be more): - it only works with exact word positions and not numbers in a text. The within two words approach may be remedied by using string.split on substrings however if really needed. Depending on you purposes an even rougher approach is by taking some default length for words (this is a bit faster). These are not very elegant solutions, though. - because of an approach that is not so coupled with (Z)Catalog, integration strategies are less obvious (at least for me) - the positionIndex might be used for further processing as is, in my approach this is less obvious. another 2 cents Rik
participants (9)
-
Andreas Jung -
Andreas Jung -
Chris McDonough -
Chris Withers -
Dieter Maurer -
Erik Enge -
Oren Yosifon -
R. David Murray -
Rik Hoekstra