Rik Hoekstra wrote:
Partial searching is available in the CVS.
Michel,
Is that just partial searching, or also wildcard and even regexp searching?
I define 'Partial' and 'wildcard' as the same thing, but I'm using my own terminology so I could be wrong. I define partial as 'finding part or all of a word', which can be acomplished with wildcards '*part*'. The CVS supports '*' and '?' wildcard characters (the actual character used is configurable, in case you really want to keep those question marks). This involved creating a new kind of Lexicon called a GlobbingLexicon ('Globbing' is the use of * or ? to match patterns, for those who didn't know...). The GlobingLexicon is quite flexible and nice, the only disadvantage over a regular Lexicon (which does no partial seraching) is that it consumes about three times as much memory since each word is split into bi-grams and indexed in a mini 'lexical index'. It's really quite simple. Take the words 'flexible' and 'fleece'. When these words are added to a Globbing Lexiocon, they are turned into:: flexible -> ['$f', 'fl', 'le', 'ex', 'xi', 'ib', 'bl', 'le', 'e$'] fleece -> ['$f', 'fl', 'le', 'ee', 'ec', 'ce', 'e$'] ('$' indicates the beginning or end of a word) Each 'bi-gram' is indexed against the words that that bi-gram occours in: '$f' -> ['flexible', 'fleece'] 'fl' -> ['fiexible', 'fleece'] 'le' -> ['flexible', 'fleece'] 'ex' -> ['flexible'] ... ... 'e$' -> ['flexible', 'fleece'] When you search for 'fle*' The Lexicon's query engine turns your query into:: 'fle*' -> ['$f', 'fl', 'le'] and then looks in the lexical index for words that contain those three bi-grams. It is possible for the word 'falafle' (no doubt wrongly spelled) to contain those three lexicons, and possible false matches like that are weed out and the end, this is efficient, becauase at this point we have discarded all but a few words in the lexicon. Regular expressions are not feasiable in any searching system. Although it may be possible, with the existing lexical analysis that globbing lexicons do, to implement a larger subset of regexp than just * and ?, it is not feasable to implement the entire regexp language.
And since you keep locations of the words, is there proximity searching also possible?
The location in the document is not kept, just the score. There are TextIndex methods however for finding the positions of words in a document, this is used to support the 'Near' operator, which is '...' This operator exists in TextIndexes now (it allways has, since I took over the indexing realm), I tested them a few months ago but couldn't get the concept to work. I suspect it's buggy, the code holds over from ZTables.
Another question: how do I retrieve a list of unique words from a full-text catalog?
In 2.1, you need to hack the lexicon from Python. In 2.2, you call a Vocabulary object's 'words' method, or you can call the Vocabulary with a pattern '*' to match all words, or a more restrictive pattern if you only want all the unique words that match a pattern, like '*ing', all the words that end in ing.
Now, I know there is no standard way, but is it possible at all.
In 2.2 it is standard (and documented in the Interfaces Wiki).
Can I use the items, keys etc interfaces of the text index (perhaps with some python hacking)?
TextIndexes do not store the word, they store an integer that the lexicon maps to a word. This is so text indexes can be language independent. -Michel