Matt Hamilton wrote:
I would like to help if I had time :) I think the most efficient way of doing what you want is to construct an index based on a 'Suffix Trie' this essentially allows matching of arbitrary substrings very quickly, the only problem is that it takes up a fair amount of space. The upside is that it can be updated and incrementally added to quite easily (unlike many inverted list implementations).
I confess I have not had the chance to look at the pluggable index types in 2.4, but would really like to as I would like to port over some indexing code I was working on for another project that allows compressed storage of inverted lists for indexes. On average you can store a 32-bit document id/ref in around 4 bits, which means you save a lot of space and can keep stopwords in the lexicon (as an example try searching for 'to be or not to be' in an index that removes stopwords :). Not only do you save space, but due to the way the inverted list is read and decompressed you save time on disk access for large indexes as there is less to physically read.
Wow Matt, you seem to know what you're talking about :-) If you get a chance to implement the index I asked about, please gimme a shout, I'd love to try it out... cheers, Chris PS: Whereabouts in the UK are you?