Some thoughts on splitter
Yes! The pre-process approach is heading the right way. However, I would like to suggest not inserting spaces, but some code that does not alter the display. It seems that the unicode have set aside the code which is called zero width non-joiner. The code can then stored with the text and would be there for the editing and go through later processing. The index every char approach is not perferred due to this can be emulated by the previous method if needed. Also, the catalogue must use unicode for cross-encoding search. It is well known that Han have many encoding in big-5, gb2312, jis, etc. It is a good practice to convert all code to unicode, normalized it, then perform the splitting. The convertion must based on language and encoding, however, most html do not declare its language and encoding. I have seen some encoding detection code based on checking the freq used han characters which give a good guess of the encoding. Rgs, Kent Sin
Sin Hang Kin wrote:
Yes! The pre-process approach is heading the right way.
However, I would like to suggest not inserting spaces, but some code that does not alter the display. It seems that the unicode have set aside the code which is called zero width non-joiner. The code can then stored with the text and would be there for the editing and go through later processing.
I'm am averse to the idea of ZCatalog inserting information into documents for its own purposes, I don't think this is good design, and I doubt it's very portable.
The index every char approach is not perferred due to this can be emulated by the previous method if needed.
Also, the catalogue must use unicode for cross-encoding search. It is well known that Han have many encoding in big-5, gb2312, jis, etc. It is a good practice to convert all code to unicode, normalized it, then perform the splitting.
Sounds like a NormalizingSplitter of sorts.
The convertion must based on language and encoding, however, most html do not declare its language and encoding. I have seen some encoding detection code based on checking the freq used han characters which give a good guess of the encoding.
Can you post these comments on the interfaces Wiki so they do not get lost? http://www.zope.org/Members/michel/Projects/Interfaces/Splitter -Michel
I'm am averse to the idea of ZCatalog inserting information into documents for its own purposes, I don't think this is good design, and I doubt it's very portable.
It is not ZCatalog to make the insertion. It is a job for the pre-processor. Splitter is only to recognize the as a break point. However, it is designed by unicode which is a non-joiner. I don't see any portability issue here. Just follow what unicode says.
Sounds like a NormalizingSplitter of sorts. http://www.zope.org/Members/michel/Projects/Interfaces/Splitter
I am new to these : I do not see any way to input new info. Just a jump text box. Rgs, Kent Sin
Sin Hang Kin wrote:
I'm am averse to the idea of ZCatalog inserting information into documents for its own purposes, I don't think this is good design, and I doubt it's very portable.
It is not ZCatalog to make the insertion. It is a job for the pre-processor. Splitter is only to recognize the as a break point. However, it is designed by unicode which is a non-joiner. I don't see any portability issue here. Just follow what unicode says.
I mean portability across other objects that may want to 'use' the document object. If the object gets invisibly transformed, and other objects don't expect this, things will break. Also, unless the user specificly wants their text to be transformed they many be suprised/angered that their text was normalized to unicode. Absolutly there should be a Splitter that understand unicode, but there should also be a spliter that does not. This is the idea behind having different vocabulary objects for different languages, because they all have different needs. I'm still a bit lost on what the non-joiner is meant for. I understand that it is used in a document to divide words for languages that do not have a discrete word division character (like whitespace) and I understand that if a unicode aware Splitter encoutered them that it should split on that character, but I don't understand why the Splitter (pre-processor) should actively insert the character, I think that if the character was not there to begin with than it is a sneaky transformation to insert that character for the purposes of cataloging.
Sounds like a NormalizingSplitter of sorts. http://www.zope.org/Members/michel/Projects/Interfaces/Splitter
I am new to these : I do not see any way to input new info. Just a jump text box.
It's simple. Lot into Zope.org (as your memeber account). Go to that page, click on 'Edit this page' and edit the page. Click 'Change'. That's it. Just add your comments. Yes, you can wipe the whole thing and cause havoc if you want, but the Wiki is meant to encourage trust. Try it out and put your comments in there, otherwise they won't get captured and when these issues are worked on your ideas will not be considered. -Michel
However, I would like to suggest not inserting spaces, but some code that does not alter the display. It seems that the unicode have set aside the code which is called zero width non-joiner. The code can then stored with the text and would be there for the editing and go through later processing.
This is a good idea, although this might be a bit cumbersome to handle. A text filled with these entities will be quite hard to read. For that reason, I thought of using spaces. The display engine could strip out single spaces and reduce sequences of more spaces by one.
The index every char approach is not perferred due to this can be emulated by the previous method if needed.
I agree.
Also, the catalogue must use unicode for cross-encoding search. It is well known that Han have many encoding in big-5, gb2312, jis, etc. It is a good practice to convert all code to unicode, normalized it, then perform the splitting.
I think the encoding should be responsibility of the user -- After all, he knows what he wants to do. Sites that use more than one Han-Encoding could go with unicode, other sites might prefer to use the local encoding, since there are much more tools that can be used. If Zope starts normalizing the text, some of the users might be surprised by the results. BUt of course, certainly it should be possible to use Unicode! For this, we will have to wait for Python 1.6, though.
The convertion must based on language and encoding, however, most html do not declare its language and encoding. I have seen some encoding detection code based on checking the freq used han characters which give a good guess of the encoding.
Right, but then the user should declare the language. As I explained, if this is inherited like the acquisition, they might get away with just one declaration for a whole site. All the best, Christian
participants (3)
-
Christian Wittern -
Michel Pelletier -
Sin Hang Kin