[Zope-Dev] Re: Some thoughts on spliter

23 Apr 2000

      Searching through over a big collection with pattern matching is not good
enough. So, your scheme would only works in small scale. With a small site
or within one site, it would do well.

When the amount of information is big, the simple and/or search is not
enough. Then, we need to learn the word boundary for ideographical text.
This would be done automatically or by human or combine with both, and is
should not be limited by the system designer, because the technology will
change and the best suitable way to findout the word boundary is always
different.

I strongly aginst the single character boundary solution, which have been in
used for a long time in the currently implementation. When people think
search is simple and they found the system useful in a small system. When
apply this tech in internet, this is a joke. Do not go this way: You make a
short life system which is not going to be useful in a short time.

If you want to have such a system, simply modify the spliter.py and you have
what you need, we do not need this build into the next generation zcatalog.
Please don't.

Rgs,

Kent Sin
----- Original Message -----
From: "Christian Wittern" <chris@ccbs.ntu.edu.tw>
To: "Sin Hang Kin" <kentsin@poboxes.com>; <zope-dev@zope.org>
Sent: Friday, April 21, 2000 2:10 PM
Subject: RE: [Zope-Dev] Some thoughts on splitter (Sin Hang Kin)
...
After giving it some thought over the past few days, I came up with some
more things re Splitter and Catalog searching in general. I will first
post
them here and see what feedback people might have and then put them into
the
WIKI.
As was pointed out repeatly, words, word-boundaries and the like do not
exist in the same way as in Western languages in some Asian languages (or
writing systems). One way to overcame problems associated with
word-splitting is to do no word splitting at all and instead split on
every
character.
As soon as ZCatalog starts using Unicode, this could even be incorporated
in
the default Splitter, which could be told to do word splitting on some
character ranges and character splitting on others.
It seems to me, that this is the approach generally used on the Web by
Asian
language search engines.
To accomodate this, there have to be some changes to the way searches are
done as well: On most search engine, giving a few search terms separated
by
whitespace means ANDing them for the search, which is fine. If this is not
desired however, most search engines allow the user to use quotes to
indicate the terms should be used as a phrase. Unfortunately, Zope does
not
support this yet. I think it is highly desirable!!!
If ZCatalog would support this type of search, this could be used for
Asian
languages and searches would return results where to or more characters
are
searched for, by looking for documents, where they occur in sequence.
Does this make any sense?
All the best,
Christian