[Zope-dev] Some thoughts on splitter
Michel Pelletier
michel@digicool.com
Fri, 14 Apr 2000 07:44:48 -0700
Sin Hang Kin wrote:
>
> Yes! The pre-process approach is heading the right way.
>
> However, I would like to suggest not inserting spaces, but some code that
> does not alter the display. It seems that the unicode have set aside the
> code ‌ which is called zero width non-joiner. The code can then stored
> with the text and would be there for the editing and go through later
> processing.
I'm am averse to the idea of ZCatalog inserting information into
documents for its own purposes, I don't think this is good design, and I
doubt it's very portable.
> The index every char approach is not perferred due to this can be emulated
> by the previous method if needed.
>
> Also, the catalogue must use unicode for cross-encoding search. It is well
> known that Han have many encoding in big-5, gb2312, jis, etc. It is a good
> practice to convert all code to unicode, normalized it, then perform the
> splitting.
Sounds like a NormalizingSplitter of sorts.
> The convertion must based on language and encoding, however, most html do
> not declare its language and encoding. I have seen some encoding detection
> code based on checking the freq used han characters which give a good guess
> of the encoding.
Can you post these comments on the interfaces Wiki so they do not get
lost?
http://www.zope.org/Members/michel/Projects/Interfaces/Splitter
-Michel