I believe it means within 8 words in the current implementation...
So, "word1 NEAR wordlongerthan8characters" wouldn't come up with anything? Or is it number of characters inbetween?
8 words. Not characters. But actually I just looked at the source and it's not even that. It's treated essentially as an AND query, because the UnTextIndex code doesn't store any proximity information between words. ( I knew this once, but I had forgotten. So basically, to re-answer the question, it's not possible with the current text index implementation (please forget the 8-word assertion, I was totally wrong). The UnTextIndex code would need to be modified to store proximity information about words in documents that are indexed. This could be done by keeping a data stucture called a proximity mapping inside the text index which is basically: {wordid: {docid: [position, position, ...]}} When a document was indexed, for example "The band Tool rocks rocks", the words would be run through the splitter, and it would become (likely) ['band', 'tool', 'rocks', 'rocks']. The document (which is given an id) would then be stored normally in the text index, and as a part of that course, each word would be entered into the Vocabulary (lexicon), and would be assigned a word id. For this example, let's say the words ['band', 'tool', 'rocks'] turned into [42, 45, 78] and that the document id assigned to the indexed document "The band tool rocks rocks" is 15. As part of the indexing process, the following stuff would be added to the proximity dictionary. {42: {15: [0]}} {45: {15: [1]}} {78: {15: [2, 3]}} (...with 0, 1, 2, and 3 being the index position within documentid 15). Then later when a phrase was looked up, for example: "tool rocks rocks" ... the Catalog would treat this as (perhaps) "tool ADJOINEDBY rocks ADJOINEDBY rocks". Then perhaps the following happens: 1. The query would search in the normal textindex data structures for documents ids containing wordids 45 and 78 (it asks the vocabulary to resolve the wordid). I forget exactly how this works currently, but it works right now for AND queries, obviously. 2. the query notices that there is an ADJOINEDBY in the query and subsequently asks the proximity mapping to return the document mappings for wordids 45 and 78 (the words "tool" and "rocks"). 3. Each of these mappings is asked for the list of positions related to the document ID it got in step 1 (which in our case is 15 only). 4. For each of the returned lists: [1] [2,3] ... the code would (beginning at the end of the adjoinedby component of the query) compare the positions of the last words: starting at "rocks ADJOINEDBY rocks". It might be code like this: def adjoinedby(poslist1, poslist2, difference=1): # in our example, poslist1 = [2,3] # poslist2 = [2,3] for x in poslist2: if (x - difference) in poslist1: return 1 # will return true. ... assuming we get a match, we move on to the prior word pair in the adjoinedby list: "tool ADJOINEDBY rocks", and so on. If all the comparisons return true for the adjoinedby chain, we return the documentid 15 to the Catalog (the query matched documentid 15). We can also do NEAR searching this way, by increasing the "difference".
Hey, I'll be glad to help. If you could show me the ropes I'll hack away like, uhm, a hacker!
Well, I know the above is a little hairy, but it's an outline. The first thing to do is to write a fishbowl proposal explaining why it should be done and perhaps how it can be done... - C