[Zope-Dev] Some thoughts on splitter (Sin Hang Kin)

Fri, 21 Apr 2000 07:18:26 -0700

Christian Wittern wrote:
> 
> After giving it some thought over the past few days, I came up with some
> more things re Splitter and Catalog searching in general. I will first post
> them here and see what feedback people might have and then put them into the
> WIKI.

Excellent.

> As was pointed out repeatly, words, word-boundaries and the like do not
> exist in the same way as in Western languages in some Asian languages (or
> writing systems). One way to overcame problems associated with
> word-splitting is to do no word splitting at all and instead split on every
> character.
> 
> As soon as ZCatalog starts using Unicode,

Keep in mind that the ZCatalog will not use Unicode at all.  In fact,
the ZCatalog pretty much works with integers the whole time for
efficiency.  There is nothing language specific in ZCatalog.

What is langauges specific is the Vocabulary object, which has been
de-coupled from whence it came, the ZCatalog.  Any asian language
support will not require change the catalog at all, just creating a new
kind of vocabulary.  If this vocabulary indexes every chinese character
or whole word paterns deduced from a matching algorithm (or both) is
entirly up to the implementation of the Vocabulary object.

I can explain a little further what this concept means.  In a high level
sense, an index is a mapping from words to documents that contain those
words:

  'foo' -> 13, 22, 42
  'bar' -> 67, 22, 42

The strings are the words, and the integers are the document ids of the
documents that contain those words (think of them like page numbers).

A text index in Zope does this slightly different, instead of indexing
the word to the document ids, it indexes a word id, an integer, to the
document ids:

  34 -> 13, 22, 42
  35 -> 67, 22, 42

What the 'word' that 34 and 35 map to mean nothing to a zope text
index.  So where do word ids come from?  The Vocabulary.  The Vocabulary
maps words to word ids.  This way, if you query for:

"foo AND bar"

The query is 'turned into' "34 AND 35".  The word ids are looked up in
the Vocabulary, so, the Vocabulary contains all of the language specific
semantics of what words map to which word ids.  This also gives us a
handy way to create synonyms, since you can map more than one word to
the same id.

What words 'are' is determined by the Splitter, which is also provided
by the Vocubulary object.  This is because, like the words themsevles
being very specific to a language, so are the semantics which define
them.

> this could even be incorporated in
> the default Splitter, which could be told to do word splitting on some
> character ranges and character splitting on others.

The default splitter will probably remain fairly simple, really just a
configurable core splitter in C.  A Unicode splitter could, in the
future, subclass the default splitter and add unicode splitting
awareness.

> It seems to me, that this is the approach generally used on the Web by Asian
> language search engines.

> To accomodate this, there have to be some changes to the way searches are
> done as well: On most search engine, giving a few search terms separated by
> whitespace means ANDing them for the search, which is fine.

oh ok, I can see how this is not ideal because it could possibly false
match other words that contained your search characters in a different
order.

> If this is not
> desired however, most search engines allow the user to use quotes to
> indicate the terms should be used as a phrase. Unfortunately, Zope does not
> support this yet. I think it is highly desirable!!!

We did to, which is why text indexes do support phrase matching with
quotes.  This is hold over code from ZTables and I did not write it or
change it at all, so maybe it is broken?  Have you tested it?  Just
search for "a phrase".

> If ZCatalog would support this type of search, this could be used for Asian
> languages and searches would return results where to or more characters are
> searched for, by looking for documents, where they occur in sequence.
> 
> Does this make any sense?

Yes, I can see how this rather handily gets around needing an expensive
up front parsing into semantic chunks, the equivalent of Asian 'words'. 
This would actually not be difficult to implement at all.  What is the
benefit then of pre-parsing documents into semanticly defined 'words'
instead of just indexed sequences of characters?  The only one I can
think of is index space, since the vocabulary and the number of index
references would go down quite a bit with some up front smart
processing.

-Michel