It just occurred to me that depending on the splitter to do positions makes it impossible to alter the splitter without reindexing the whole text index... but I think this is a reasonable tradeoff. Other opinions welcome. On Sun, 17 Jun 2001 15:57:20 -0400 "Chris McDonough" <chrism@digicool.com> wrote:
On Sun, 17 Jun 2001 21:05:47 +0200 (CEST) Erik Enge <erik@thingamy.net> wrote:
On Fri, 15 Jun 2001, Chris McDonough wrote:
Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data.
Erik,
It looks like you call proximityInsert for each item returned from the splitter on the doc source. Instead of looking for the position in the source document by splitting the source up again within proximityInsert, you can keep a simple counter while you iterate over the splitter return in index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate the position entry for that word/documentId pair within proximityInsert. You never actually need to manually split the document source, instead just always rely on the splitter to bust up the doc, and manipulate the position list in place. This is not the most efficient way, but it's more efficient than your current way.
Therefore, the bit in index_object becomes:
i = 0 for word in splitter(source): self.proximityInsert(word, documentId, i) i = i + 1
The proximityInsert method becomes:
def proximityInsert(self, word, documentId, i): """Insert proximity information about this wid (word id) in the index' proximity bucket.""" wid=self.getWid(word) prox=self._proximity if not prox.has_key(wid): prox[wid]=IOBTree() prox[wid][documentId]=[i] self._p_changed = 1 else: if i in prox[wid][documentId]: return prox[wid][documentId].append(i) self._p_changed = 1
.. and the positionInDoc method goes away.
I didn't scan too hard for what else in the source this would break.
Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated.
I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector.
Cool...
[1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>