On Thu, 28 Sep 2000, Sin Hang Kin wrote:
After reading some code of query, I think the regular expression operations which in parse, quotes and parse2 were not safe for utf8 string. So, I
That wouldn't surprise me.
decide to emulate what they do. However, I do not understand what getlexicon is doing and I would like to learn what q should looks like before it is passed to evaluate. I do not understand that vocabulary seems to store like integer, is getlexicon a step to look up the string to convert them to integer? I am getting lost.
I don't fully understand Lexicon myself, but I've at least spent some time groveling around in the code. I understand there's been a relatively recent checkin of a new version of the text index stuff that at least provides clearer variable names and additional comments; if you aren't working from cvs version you might want to browse the files on the cvs web interface. So, here's what I understand: The lexicon takes words and associates them with integers. It is the integers that are stored in the text index. So in the final stages of the search process, the parsed words are looked up in the lexicon to get the integer, and the integer is then passed to the index to get back the result set (list of documents containing the word). The result set is itself a list of integers. I think it is in fact pairs (or some more complex data structure); at the least the index stores the document number and the word offset (I think it's a word offset) of the word into the document. As for what q looks like...well, I haven't grovelled through the parse, quote, parens, and parse2 code much, so I'm guess a bit here: I *think* that before it goes into evaluate q is a list of sequences or words, where the sequences are a list of sequences or words....recursive. The sub-sequences would be the parenthesized expressions from the original string. In the original string, any occurences of the pair of words 'and not' were replaced by 'andnot'. Any quoted strings (double quotes only, I believe) were replaced by sequences of words separated by the 'near' operator ('...'). parse2 makes sure that every other item in q is an operator, by sticking the default operator, 'or', in between any pairs that aren't separated by an operator. If I'm right, an expression like: This is and not a (good "test of") searching should end up feeding to evaluate a 'q' like this: ('this', 'or', 'is', 'andnot', 'a', 'or', ('good', 'or', ('test', '...' 'of')), 'or', 'searching') I'm least sure of those parens around test...of. Maybe this will at least give you a clue to enable you to figure out what the code *really* does <grin>. --RDM