Once you're satisfied with the implementation, would you be willing submit the module to the collector?
Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data.
Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated.
I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector.
Doing something similar (in fact what I needed was citations of word usage) I took a two step approach, with the idea that most of the actual returning of results would have to be done on a much smaller subset of documents than if you'd have to index all documents with word indexes and positions. I use a normal textindex for querying. Then if a document is returned by the query I start processing the documents. This requires parsing the query in a slightly different way (throw out the NOTs). The two step approach has the advantage that you can postpone processing actual documents until you return the results for the specific documents. Using your positionInDoc will require a _lot_ of processing (why does it use string.split btw and not Splitter?; why split on " " and not on string.whitespace?). I have used string.find for finding word positions, which is probably faster than looping a list of words. BTW, I'd rather use Splitter, but word positions appeared not to be reliable (bug, or something I didn't understand; anyhow, string.find works for me and is fast) def splitit(txt, word): postions = [] start = 0 while 1: res = string.find(txt, word, start) if res is -1: break else: start = res+1 postions.append(res) return postions <sidenote>Perhaps using re would perhaps also be an option, but allowing regular expressions will complicate searching a lot, so I use globbing lexicon for expanding and then do the matching on the expanded items (if necessary - not if using [wordpart]*)</sidenote> Advantages of using this approach: - it's faster. - it splits up the query processing part in different subparts which also contributes to speeding things up. - it's also more flexible, as you can divide searching and parsing over different webrequests, and even make them dependend on the number of results. For example: why return text fragments from all documents if your users will not be able to see all the results anyway. Or why return all fragments containing word combinations from one single document while returning a few occurrences from different documents is more useful for your users. Note that this will mainly affect returning text fragments, which may or may not be useful. There's also a couple of disadvantages (as I see them , but there may be more): - it only works with exact word positions and not numbers in a text. The within two words approach may be remedied by using string.split on substrings however if really needed. Depending on you purposes an even rougher approach is by taking some default length for words (this is a bit faster). These are not very elegant solutions, though. - because of an approach that is not so coupled with (Z)Catalog, integration strategies are less obvious (at least for me) - the positionIndex might be used for further processing as is, in my approach this is less obvious. another 2 cents Rik