[Zope-dev] ZCatalog with UTF-8 Chinese

Thu, 28 Sep 2000 21:01:27 -0400 (EDT)

On Thu, 28 Sep 2000, Sin Hang Kin wrote:
> After reading some code of query, I think the regular expression operations
> which in parse, quotes and parse2 were not safe for utf8 string. So, I

That wouldn't surprise me.

> decide to emulate what they do. However, I do not understand what getlexicon
> is doing and I would like to learn what  q should looks like before it is
> passed to evaluate. I do not understand that vocabulary seems to store like
> integer, is getlexicon a step to look up the string to convert them to
> integer? I am getting lost.

I don't fully understand Lexicon myself, but I've at least spent some
time groveling around in the code.  I understand there's been a relatively
recent checkin of a new version of the text index stuff that at least
provides clearer variable names and additional comments; if you aren't
working from cvs version you might want to browse the files on the
cvs web interface.

So, here's what I understand:

The lexicon takes words and associates them with integers.  It is the
integers that are stored in the text index.  So in the final stages
of the search process, the parsed words are looked up in the lexicon
to get the integer, and the integer is then passed to the index
to get back the result set (list of documents containing the word).
The result set is itself a list of integers.  I think it is in fact
pairs (or some more complex data structure); at the least the index
stores the document number and the word offset (I think it's a word
offset) of the word into the document.

As for what q looks like...well, I haven't grovelled through the
parse, quote, parens, and parse2 code much, so I'm guess a bit here:
I *think* that before it goes into evaluate q is a list of sequences
or words, where the sequences are a list of sequences or
words....recursive.  The sub-sequences would be the parenthesized
expressions from the original string.  In the original string, any
occurences of the pair of words 'and not' were replaced by 'andnot'.
Any quoted strings (double quotes only, I believe) were replaced
by sequences of words separated by the 'near' operator ('...').
parse2 makes sure that every other item in q is an operator, by
sticking the default operator, 'or', in between any pairs that
aren't separated by an operator.

If I'm right, an expression like:

This is and not a (good "test of") searching

should end up feeding to evaluate a 'q' like this:

('this', 'or', 'is', 'andnot', 'a', 'or', ('good', 'or', ('test', '...'
   'of')), 'or', 'searching')

I'm least sure of those parens around test...of.

Maybe this will at least give you a clue to enable you to figure
out what the code *really* does <grin>.

--RDM