[Zope-dev] Some thoughts on splitter

Christian Wittern chris@ccbs.ntu.edu.tw
Sat, 15 Apr 2000 15:57:38 +0800


>
> However, I would like to suggest not inserting spaces, but some code that
> does not alter the display. It seems that the unicode have set aside the
> code ‌ which is called zero width non-joiner. The code can
> then stored
> with the text and would be there for the editing and go through later
> processing.

This is a good idea, although this might be a bit cumbersome to handle. A
text filled with these entities will be quite hard to read. For that reason,
I thought of using spaces. The display engine could strip out single spaces
and reduce sequences of more spaces by one.
>
> The index every char approach is not perferred due to this can be emulated
> by the previous method if needed.

I agree.
>
> Also, the catalogue must use  unicode for cross-encoding search.
> It is well
> known that Han have many encoding in big-5, gb2312, jis, etc. It is a good
> practice to convert all code to unicode, normalized it, then perform the
> splitting.

I think the encoding should be responsibility of the user -- After all, he
knows what he wants to do. Sites that use more than one Han-Encoding could
go with unicode, other sites might prefer to use the local encoding, since
there are much more tools that can be used.
If Zope starts normalizing the text, some of the users might be surprised by
the results.

BUt of course, certainly it should be possible to use Unicode! For this, we
will have to wait for Python 1.6, though.

>
> The convertion must based on language and encoding, however, most html do
> not declare its language and encoding. I have seen some encoding detection
> code based on checking the freq used han characters which give a
> good guess
> of the encoding.

Right, but then the user should declare the language. As I explained, if
this is inherited like the acquisition, they might get away with just one
declaration for a whole site.

All the best,

Christian