PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Chris McDonough
chrism@digicool.com
Sun, 17 Jun 2001 15:59:37 -0400
It just occurred to me that depending on the splitter to do
positions makes it impossible to alter the splitter without
reindexing the whole text index... but I think this is a
reasonable tradeoff. Other opinions welcome.
On Sun, 17 Jun 2001 15:57:20 -0400
"Chris McDonough" <chrism@digicool.com> wrote:
> On Sun, 17 Jun 2001 21:05:47 +0200 (CEST)
> Erik Enge <erik@thingamy.net> wrote:
> > On Fri, 15 Jun 2001, Chris McDonough wrote:
> >
> > > Once you're satisfied with the implementation, would
> > you be willing
> > > submit the module to the collector?
> >
> > Do you think you (or someone else for that matter)
> could
> > have a look at
> > [1] the method that returns the position in the
> document
> > - positionInDoc()
> > - to how that could be made to run much faster? Maybe
> it
> > is how it
> > used... It is too slow to be very useful when indexing
> > large amounts of
> > data.
>
> Erik,
>
> It looks like you call proximityInsert for each item
> returned from the splitter on the doc source. Instead of
> looking for the position in the source document by
> splitting
> the source up again within proximityInsert, you can keep
> a
> simple counter while you iterate over the splitter return
> in
> index_object, because the splitter return has all the
> words
> in order, even the dupes... as you iterate, you can
> mutate
> the position entry for that word/documentId pair within
> proximityInsert. You never actually need to manually
> split
> the document source, instead just always rely on the
> splitter to bust up the doc, and manipulate the position
> list in place. This is not the most efficient way, but
> it's
> more efficient than your current way.
>
> Therefore, the bit in index_object becomes:
>
> i = 0
> for word in splitter(source):
> self.proximityInsert(word, documentId, i)
> i = i + 1
>
> The proximityInsert method becomes:
>
> def proximityInsert(self, word, documentId, i):
> """Insert proximity information about this wid (word
> id)
> in
> the index' proximity bucket."""
> wid=self.getWid(word)
> prox=self._proximity
> if not prox.has_key(wid):
> prox[wid]=IOBTree()
> prox[wid][documentId]=[i]
> self._p_changed = 1
> else:
> if i in prox[wid][documentId]: return
> prox[wid][documentId].append(i)
> self._p_changed = 1
>
> .. and the positionInDoc method goes away.
>
> I didn't scan too hard for what else in the source this
> would break.
>
> > Anyway, I suck at making Python fast (or using it the
> > right way, which
> > ever I've fallen pray for this time ;-), and any hints
> > would be greatly
> > appretiated.
> >
> > I've been indexing and searching a lot this weekend,
> and
> > bar that problem
> > with the indexing-speed it seems ok and I have no
> issues
> > submitting it to
> > the Collector.
>
> Cool...
>
> >
> > [1] <URL:http://nittin.net/erik/software/PositionIndex/PositionIndex.py>
> >
>