[Zope-dev] Some thoughts on splitter

13 Apr 2000

      Dear Zopistas, dear Michel,

I have some thoughts on splitting text, mainly out of my experience with
Chinese texts.

1. The handling of text should depend on some language tagging, similar to
HTML4. This ensures that multilingual documents can be treated correctly.
For backward compatibility, the default (in absence of a language tag) could
be English, and of course the acquisition methods should be used to provide
reasonable easy ways for multilingual sites to define the language once and
be done with.

2. So far, automatic splitting algorithms have been able to reach 70-90%
accuracy, depending on the type of texts, vocabulary and the day of the
week. There are some systematic problems, that make it unlikely to see a
100% solution. Even the very good Chasen for splitting Japanese, which is
used in JSplitter can not split Japanese with 100% accuracy.  The bottomline
of this is, there is a need to leave a door open for human intervention in
finetuning the splitting.
I would therefore think in the direction of a preprocessor to the splitter,
that inserts spaces at places it thinks ok to split. This should be done
upon importing text in Zope or upon the first save in edits of new text.
People working with the texts could then finetune the splitting to avoid
blunders. The Zope splitter could then happily just split at the spaces as
usual. Of course, depending on the language settings, these spaces should be
filtered out before the text is presented in normal view.

3. A completely different approach would be to forget about splitting and
just index into every Asian character. Jim Breen employed this method in his
Japanese-English dictionary EDICT (). This allows efficient retrieval of
arbitrary strings. I am not sure however, how this would adopt to the kind
of dynamic indexing required in Zope.

All the best, Christian

Dr. Christian Wittern
Chung-Hwa Institute of Buddhist Studies
276, Kuang Ming Road, Peitou 112
Taipei, TAIWAN
Tel. +886-2-2892-6111#65, Email chris@ccbs.ntu.edu.tw