PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)

Sun, 17 Jun 2001 15:59:37 -0400

It just occurred to me that depending on the splitter to do
positions makes it impossible to alter the splitter without
reindexing the whole text index... but I think this is a
reasonable tradeoff.  Other opinions welcome.

On Sun, 17 Jun 2001 15:57:20 -0400
 "Chris McDonough" <chrism@digicool.com> wrote:
> On Sun, 17 Jun 2001 21:05:47 +0200 (CEST)
>  Erik Enge <erik@thingamy.net> wrote:
> > On Fri, 15 Jun 2001, Chris McDonough wrote:
> > 
> > > Once you're satisfied with the implementation, would
> > you be willing
> > > submit the module to the collector?
> > 
> > Do you think you (or someone else for that matter)
> could
> > have a look at
> > [1] the method that returns the position in the
> document
> > - positionInDoc()
> > - to how that could be made to run much faster?  Maybe
> it
> > is how it
> > used...  It is too slow to be very useful when indexing
> > large amounts of
> > data.
> 
> Erik,
> 
> It looks like you call proximityInsert for each item
> returned from the splitter on the doc source.  Instead of
> looking for the position in the source document by
> splitting
> the source up again within proximityInsert, you can keep
> a
> simple counter while you iterate over the splitter return
> in
> index_object, because the splitter return has all the
> words
> in order, even the dupes... as you iterate, you can
> mutate
> the position entry for that word/documentId pair within
> proximityInsert.  You never actually need to manually
> split
> the document source, instead just always rely on the
> splitter to bust up the doc, and manipulate the position
> list in place.  This is not the most efficient way, but
> it's
> more efficient than your current way.
> 
> Therefore, the bit in index_object becomes:
> 
> i = 0
> for word in splitter(source):			
>     self.proximityInsert(word, documentId, i)
>     i = i + 1
> 
> The proximityInsert method becomes:
> 
> def proximityInsert(self, word, documentId, i):
>     """Insert proximity information about this wid (word
> id)
> in
>     the index' proximity bucket."""
>     wid=self.getWid(word)
>     prox=self._proximity
>     if not prox.has_key(wid):
>         prox[wid]=IOBTree()
>         prox[wid][documentId]=[i]
>         self._p_changed = 1
>     else:
>         if i in prox[wid][documentId]: return
>         prox[wid][documentId].append(i)
>         self._p_changed = 1
> 
> .. and the positionInDoc method goes away.
> 
> I didn't scan too hard for what else in the source this
> would break.
> 
> > Anyway, I suck at making Python fast (or using it the
> > right way, which
> > ever I've fallen pray for this time ;-), and any hints
> > would be greatly
> > appretiated.
> > 
> > I've been indexing and searching a lot this weekend,
> and
> > bar that problem
> > with the indexing-speed it seems ok and I have no
> issues
> > submitting it to
> > the Collector.
> 
> Cool...
> 
> > 
> > [1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
> > 
>