Christian Wittern wrote:
After giving it some thought over the past few days, I came up with some more things re Splitter and Catalog searching in general. I will first post them here and see what feedback people might have and then put them into the WIKI.
Excellent.
As was pointed out repeatly, words, word-boundaries and the like do not exist in the same way as in Western languages in some Asian languages (or writing systems). One way to overcame problems associated with word-splitting is to do no word splitting at all and instead split on every character.
As soon as ZCatalog starts using Unicode,
Keep in mind that the ZCatalog will not use Unicode at all. In fact, the ZCatalog pretty much works with integers the whole time for efficiency. There is nothing language specific in ZCatalog. What is langauges specific is the Vocabulary object, which has been de-coupled from whence it came, the ZCatalog. Any asian language support will not require change the catalog at all, just creating a new kind of vocabulary. If this vocabulary indexes every chinese character or whole word paterns deduced from a matching algorithm (or both) is entirly up to the implementation of the Vocabulary object. I can explain a little further what this concept means. In a high level sense, an index is a mapping from words to documents that contain those words: 'foo' -> 13, 22, 42 'bar' -> 67, 22, 42 The strings are the words, and the integers are the document ids of the documents that contain those words (think of them like page numbers). A text index in Zope does this slightly different, instead of indexing the word to the document ids, it indexes a word id, an integer, to the document ids: 34 -> 13, 22, 42 35 -> 67, 22, 42 What the 'word' that 34 and 35 map to mean nothing to a zope text index. So where do word ids come from? The Vocabulary. The Vocabulary maps words to word ids. This way, if you query for: "foo AND bar" The query is 'turned into' "34 AND 35". The word ids are looked up in the Vocabulary, so, the Vocabulary contains all of the language specific semantics of what words map to which word ids. This also gives us a handy way to create synonyms, since you can map more than one word to the same id. What words 'are' is determined by the Splitter, which is also provided by the Vocubulary object. This is because, like the words themsevles being very specific to a language, so are the semantics which define them.
this could even be incorporated in the default Splitter, which could be told to do word splitting on some character ranges and character splitting on others.
The default splitter will probably remain fairly simple, really just a configurable core splitter in C. A Unicode splitter could, in the future, subclass the default splitter and add unicode splitting awareness.
It seems to me, that this is the approach generally used on the Web by Asian language search engines.
To accomodate this, there have to be some changes to the way searches are done as well: On most search engine, giving a few search terms separated by whitespace means ANDing them for the search, which is fine.
oh ok, I can see how this is not ideal because it could possibly false match other words that contained your search characters in a different order.
If this is not desired however, most search engines allow the user to use quotes to indicate the terms should be used as a phrase. Unfortunately, Zope does not support this yet. I think it is highly desirable!!!
We did to, which is why text indexes do support phrase matching with quotes. This is hold over code from ZTables and I did not write it or change it at all, so maybe it is broken? Have you tested it? Just search for "a phrase".
If ZCatalog would support this type of search, this could be used for Asian languages and searches would return results where to or more characters are searched for, by looking for documents, where they occur in sequence.
Does this make any sense?
Yes, I can see how this rather handily gets around needing an expensive up front parsing into semantic chunks, the equivalent of Asian 'words'. This would actually not be difficult to implement at all. What is the benefit then of pre-parsing documents into semanticly defined 'words' instead of just indexed sequences of characters? The only one I can think of is index space, since the vocabulary and the number of index references would go down quite a bit with some up front smart processing. -Michel