Searching through over a big collection with pattern matching is not good enough. So, your scheme would only works in small scale. With a small site or within one site, it would do well. When the amount of information is big, the simple and/or search is not enough. Then, we need to learn the word boundary for ideographical text. This would be done automatically or by human or combine with both, and is should not be limited by the system designer, because the technology will change and the best suitable way to findout the word boundary is always different. I strongly aginst the single character boundary solution, which have been in used for a long time in the currently implementation. When people think search is simple and they found the system useful in a small system. When apply this tech in internet, this is a joke. Do not go this way: You make a short life system which is not going to be useful in a short time. If you want to have such a system, simply modify the spliter.py and you have what you need, we do not need this build into the next generation zcatalog. Please don't. Rgs, Kent Sin ----- Original Message ----- From: "Christian Wittern" <chris@ccbs.ntu.edu.tw> To: "Sin Hang Kin" <kentsin@poboxes.com>; <zope-dev@zope.org> Sent: Friday, April 21, 2000 2:10 PM Subject: RE: [Zope-Dev] Some thoughts on splitter (Sin Hang Kin)
After giving it some thought over the past few days, I came up with some more things re Splitter and Catalog searching in general. I will first post them here and see what feedback people might have and then put them into the WIKI.
As was pointed out repeatly, words, word-boundaries and the like do not exist in the same way as in Western languages in some Asian languages (or writing systems). One way to overcame problems associated with word-splitting is to do no word splitting at all and instead split on every character.
As soon as ZCatalog starts using Unicode, this could even be incorporated in the default Splitter, which could be told to do word splitting on some character ranges and character splitting on others.
It seems to me, that this is the approach generally used on the Web by Asian language search engines.
To accomodate this, there have to be some changes to the way searches are done as well: On most search engine, giving a few search terms separated by whitespace means ANDing them for the search, which is fine. If this is not desired however, most search engines allow the user to use quotes to indicate the terms should be used as a phrase. Unfortunately, Zope does not support this yet. I think it is highly desirable!!!
If ZCatalog would support this type of search, this could be used for Asian languages and searches would return results where to or more characters are searched for, by looking for documents, where they occur in sequence.
Does this make any sense?
All the best,
Christian