Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)

17 Jun 2001


      It just occurred to me that depending on the splitter to do
positions makes it impossible to alter the splitter without
reindexing the whole text index... but I think this is a
reasonable tradeoff.  Other opinions welcome.

On Sun, 17 Jun 2001 15:57:20 -0400
 "Chris McDonough" <chrism@digicool.com> wrote:
...
On Sun, 17 Jun 2001 21:05:47 +0200 (CEST)
 Erik Enge <erik@thingamy.net> wrote:
...
On Fri, 15 Jun 2001, Chris McDonough wrote:
...
Once you're satisfied with the implementation, would
you be willing
submit the module to the collector?
Do you think you (or someone else for that matter)
could
have a look at
[1] the method that returns the position in the
document
- positionInDoc()
- to how that could be made to run much faster?  Maybe
it
is how it
used...  It is too slow to be very useful when indexing
large amounts of
data.
Erik,
It looks like you call proximityInsert for each item
returned from the splitter on the doc source.  Instead of
looking for the position in the source document by
splitting
the source up again within proximityInsert, you can keep
a
simple counter while you iterate over the splitter return
in
index_object, because the splitter return has all the
words
in order, even the dupes... as you iterate, you can
mutate
the position entry for that word/documentId pair within
proximityInsert.  You never actually need to manually
split
the document source, instead just always rely on the
splitter to bust up the doc, and manipulate the position
list in place.  This is not the most efficient way, but
it's
more efficient than your current way.
Therefore, the bit in index_object becomes:
i = 0
for word in splitter(source):			
    self.proximityInsert(word, documentId, i)
    i = i + 1
The proximityInsert method becomes:
def proximityInsert(self, word, documentId, i):
    """Insert proximity information about this wid (word
id)
in
    the index' proximity bucket."""
    wid=self.getWid(word)
    prox=self._proximity
    if not prox.has_key(wid):
        prox[wid]=IOBTree()
        prox[wid][documentId]=[i]
        self._p_changed = 1
    else:
        if i in prox[wid][documentId]: return
        prox[wid][documentId].append(i)
        self._p_changed = 1
.. and the positionInDoc method goes away.
I didn't scan too hard for what else in the source this
would break.
...
Anyway, I suck at making Python fast (or using it the
right way, which
ever I've fallen pray for this time ;-), and any hints
would be greatly
appretiated.
I've been indexing and searching a lot this weekend,
and
bar that problem
with the indexing-speed it seems ok and I have no
issues
submitting it to
the Collector.
Cool...
...
[1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>