[Zope-Dev] Re: Some thoughts on spliter

Sin Hang Kin iekentsin@infoez.com.mo
Sun, 23 Apr 2000 08:00:07 +0800


Searching through over a big collection with pattern matching is not good
enough. So, your scheme would only works in small scale. With a small site
or within one site, it would do well.

When the amount of information is big, the simple and/or search is not
enough. Then, we need to learn the word boundary for ideographical text.
This would be done automatically or by human or combine with both, and is
should not be limited by the system designer, because the technology will
change and the best suitable way to findout the word boundary is always
different.

I strongly aginst the single character boundary solution, which have been in
used for a long time in the currently implementation. When people think
search is simple and they found the system useful in a small system. When
apply this tech in internet, this is a joke. Do not go this way: You make a
short life system which is not going to be useful in a short time.

If you want to have such a system, simply modify the spliter.py and you have
what you need, we do not need this build into the next generation zcatalog.
Please don't.

Rgs,

Kent Sin
----- Original Message -----
From: "Christian Wittern" <chris@ccbs.ntu.edu.tw>
To: "Sin Hang Kin" <kentsin@poboxes.com>; <zope-dev@zope.org>
Sent: Friday, April 21, 2000 2:10 PM
Subject: RE: [Zope-Dev] Some thoughts on splitter (Sin Hang Kin)


> After giving it some thought over the past few days, I came up with some
> more things re Splitter and Catalog searching in general. I will first
post
> them here and see what feedback people might have and then put them into
the
> WIKI.
>
> As was pointed out repeatly, words, word-boundaries and the like do not
> exist in the same way as in Western languages in some Asian languages (or
> writing systems). One way to overcame problems associated with
> word-splitting is to do no word splitting at all and instead split on
every
> character.
>
> As soon as ZCatalog starts using Unicode, this could even be incorporated
in
> the default Splitter, which could be told to do word splitting on some
> character ranges and character splitting on others.
>
> It seems to me, that this is the approach generally used on the Web by
Asian
> language search engines.
>
> To accomodate this, there have to be some changes to the way searches are
> done as well: On most search engine, giving a few search terms separated
by
> whitespace means ANDing them for the search, which is fine. If this is not
> desired however, most search engines allow the user to use quotes to
> indicate the terms should be used as a phrase. Unfortunately, Zope does
not
> support this yet. I think it is highly desirable!!!
>
> If ZCatalog would support this type of search, this could be used for
Asian
> languages and searches would return results where to or more characters
are
> searched for, by looking for documents, where they occur in sequence.
>
> Does this make any sense?
>
> All the best,
>
> Christian
>
>