Re: [Zope-Dev] Some thoughts on splitter (Sin Hang Kin)
----- Original Message ----- From: <zope-dev-admin@zope.org> To: <zope-dev@zope.org> Sent: Monday, April 17, 2000 3:00 AM Subject: Zope-Dev digest, Vol 1 #474 - 8 msgs
Send Zope-Dev maillist submissions to zope-dev@zope.org
To subscribe or unsubscribe via the web, visit http://lists.zope.org/mailman/listinfo/zope-dev or, via email, send a message with subject or body 'help' to zope-dev-request@zope.org You can reach the person managing the list at zope-dev-admin@zope.org
(When replying, please edit your Subject line so it is more specific than "Re: Contents of Zope-Dev digest...")
I mean portability across other objects that may want to 'use' the document object. If the object gets invisibly transformed, and other objects don't expect this, things will break. Also, unless the user specificly wants their text to be transformed they many be suprised/angered that their text was normalized to unicode.
There were two things: 1. insert of the non-joiner to mark the break point of words. 2. The normalize process. Step 1 will really change the document. But it is still not what zcatalog is doing. It is up to the content manager to decide to make that or not. If he decide to do so, he should prepare the content as required. Or make a pre-processor to do it. Only the splitter recognize what the non-joiner as a break point of the word. It is just like spliter recognize space and tab were word break point. Not zope make any decision that nobody wants. Step 2. is performed on making the index, just as you would do to capital the index terms. Not thing change the original content, just when zcatalog make the index, it convert the various encoding to unicode, make normalization, and optionally do more changes like stemming, sym combination etc. But all these will not change the content. Rgs, Kent Sin
Sin Hang Kin wrote:
There were two things: 1. insert of the non-joiner to mark the break point of words. 2. The normalize process.
Step 1 will really change the document. But it is still not what zcatalog is doing. It is up to the content manager to decide to make that or not. If he decide to do so, he should prepare the content as required. Or make a pre-processor to do it. Only the splitter recognize what the non-joiner as a break point of the word. It is just like spliter recognize space and tab were word break point. Not zope make any decision that nobody wants.
Oh I see, in this case you would want a UnicodeSplitter. Keep in mind that the Splitter is an attribute of a Lexicon object, and any number of Lexicons (Vocabularies in the Zope managment interface) can be created. In the case of a ZCatalog that wants to split documents formatted in this way you describe, you would index them with a ZCatalog that used a UnicodeSplitter to split on the non-joiner. I understand what you are getting at now.
Step 2. is performed on making the index, just as you would do to capital the index terms. Not thing change the original content, just when zcatalog make the index, it convert the various encoding to unicode, make normalization, and optionally do more changes like stemming, sym combination etc. But all these will not change the content.
Actually, the index code wouldn't need to change at all, indexes map 'words' to documents that contain those words, but the index themselves don't know anything about the words, they map 'word ids' (integers) to the documents. The object that reverse maps these word ids to words is, once again, the Lexicon. So, your UnicodeLexicon could do the normalization you speak of, *and* provide the UnicodeSplitter that does #1. The class heirarchy would look like this: UnicodeVocabulary ^ | UnicodeLexicon (provides)--> UnicodeSplitter Other than coming up with the Splitter and the normalization code, no changes at all would need to be done to the 2.2 ZCatalog to do what you want. This could be shipped as a clean third party product. -Michel
After giving it some thought over the past few days, I came up with some more things re Splitter and Catalog searching in general. I will first post them here and see what feedback people might have and then put them into the WIKI. As was pointed out repeatly, words, word-boundaries and the like do not exist in the same way as in Western languages in some Asian languages (or writing systems). One way to overcame problems associated with word-splitting is to do no word splitting at all and instead split on every character. As soon as ZCatalog starts using Unicode, this could even be incorporated in the default Splitter, which could be told to do word splitting on some character ranges and character splitting on others. It seems to me, that this is the approach generally used on the Web by Asian language search engines. To accomodate this, there have to be some changes to the way searches are done as well: On most search engine, giving a few search terms separated by whitespace means ANDing them for the search, which is fine. If this is not desired however, most search engines allow the user to use quotes to indicate the terms should be used as a phrase. Unfortunately, Zope does not support this yet. I think it is highly desirable!!! If ZCatalog would support this type of search, this could be used for Asian languages and searches would return results where to or more characters are searched for, by looking for documents, where they occur in sequence. Does this make any sense? All the best, Christian
Christian Wittern wrote:
After giving it some thought over the past few days, I came up with some more things re Splitter and Catalog searching in general. I will first post them here and see what feedback people might have and then put them into the WIKI.
Excellent.
As was pointed out repeatly, words, word-boundaries and the like do not exist in the same way as in Western languages in some Asian languages (or writing systems). One way to overcame problems associated with word-splitting is to do no word splitting at all and instead split on every character.
As soon as ZCatalog starts using Unicode,
Keep in mind that the ZCatalog will not use Unicode at all. In fact, the ZCatalog pretty much works with integers the whole time for efficiency. There is nothing language specific in ZCatalog. What is langauges specific is the Vocabulary object, which has been de-coupled from whence it came, the ZCatalog. Any asian language support will not require change the catalog at all, just creating a new kind of vocabulary. If this vocabulary indexes every chinese character or whole word paterns deduced from a matching algorithm (or both) is entirly up to the implementation of the Vocabulary object. I can explain a little further what this concept means. In a high level sense, an index is a mapping from words to documents that contain those words: 'foo' -> 13, 22, 42 'bar' -> 67, 22, 42 The strings are the words, and the integers are the document ids of the documents that contain those words (think of them like page numbers). A text index in Zope does this slightly different, instead of indexing the word to the document ids, it indexes a word id, an integer, to the document ids: 34 -> 13, 22, 42 35 -> 67, 22, 42 What the 'word' that 34 and 35 map to mean nothing to a zope text index. So where do word ids come from? The Vocabulary. The Vocabulary maps words to word ids. This way, if you query for: "foo AND bar" The query is 'turned into' "34 AND 35". The word ids are looked up in the Vocabulary, so, the Vocabulary contains all of the language specific semantics of what words map to which word ids. This also gives us a handy way to create synonyms, since you can map more than one word to the same id. What words 'are' is determined by the Splitter, which is also provided by the Vocubulary object. This is because, like the words themsevles being very specific to a language, so are the semantics which define them.
this could even be incorporated in the default Splitter, which could be told to do word splitting on some character ranges and character splitting on others.
The default splitter will probably remain fairly simple, really just a configurable core splitter in C. A Unicode splitter could, in the future, subclass the default splitter and add unicode splitting awareness.
It seems to me, that this is the approach generally used on the Web by Asian language search engines.
To accomodate this, there have to be some changes to the way searches are done as well: On most search engine, giving a few search terms separated by whitespace means ANDing them for the search, which is fine.
oh ok, I can see how this is not ideal because it could possibly false match other words that contained your search characters in a different order.
If this is not desired however, most search engines allow the user to use quotes to indicate the terms should be used as a phrase. Unfortunately, Zope does not support this yet. I think it is highly desirable!!!
We did to, which is why text indexes do support phrase matching with quotes. This is hold over code from ZTables and I did not write it or change it at all, so maybe it is broken? Have you tested it? Just search for "a phrase".
If ZCatalog would support this type of search, this could be used for Asian languages and searches would return results where to or more characters are searched for, by looking for documents, where they occur in sequence.
Does this make any sense?
Yes, I can see how this rather handily gets around needing an expensive up front parsing into semantic chunks, the equivalent of Asian 'words'. This would actually not be difficult to implement at all. What is the benefit then of pre-parsing documents into semanticly defined 'words' instead of just indexed sequences of characters? The only one I can think of is index space, since the vocabulary and the number of index references would go down quite a bit with some up front smart processing. -Michel
Michel Pelletier wrote:
Christian Wittern wrote:
As soon as ZCatalog starts using Unicode,
Keep in mind that the ZCatalog will not use Unicode at all. In fact, the ZCatalog pretty much works with integers the whole time for efficiency. There is nothing language specific in ZCatalog.
[very interesting explanation deleted ... ]
What words 'are' is determined by the Splitter, which is also provided by the Vocubulary object. This is because, like the words themsevles being very specific to a language, so are the semantics which define them.
I see much clearer now. I started looking at the source in CVS, but it somehow differs from what I see in 2.1.6 on Windows. Has there been changes in this area? What files should I look at?
To accomodate this, there have to be some changes to the way searches are done as well: On most search engine, giving a few search terms separated by whitespace means ANDing them for the search, which is fine.
oh ok, I can see how this is not ideal because it could possibly false match other words that contained your search characters in a different order.
Right, but it is not that much of a problem, I think the interface could take care of that.
If this is not desired however, most search engines allow the user to use quotes to indicate the terms should be used as a phrase. Unfortunately, Zope does not support this yet. I think it is highly desirable!!!
We did to, which is why text indexes do support phrase matching with quotes. This is hold over code from ZTables and I did not write it or change it at all, so maybe it is broken? Have you tested it? Just search for "a phrase".
I don't think "a phrase" would work, because 'a' is a stopword. On 2.1.6 I created a textindex as described in the ZCatalog Howto, but this type of search does not seem to work, even in cases where no stopwords are involved.
If ZCatalog would support this type of search, this could be used for Asian languages and searches would return results where to or more characters are searched for, by looking for documents, where they occur in sequence.
Does this make any sense?
Yes, I can see how this rather handily gets around needing an expensive up front parsing into semantic chunks, the equivalent of Asian 'words'. This would actually not be difficult to implement at all. What is the benefit then of pre-parsing documents into semanticly defined 'words' instead of just indexed sequences of characters? The only one I can think of is index space, since the vocabulary and the number of index references would go down quite a bit with some up front smart processing.
Yes, the index space would go down, but the wordlist in the splitter would have to be pretty big and would be very domain specific. The only problem with the approach described above is that you could find things that are not a word and see things where you look for two characters, and get a result where they are a substring of a longer word. -- This could be regarded as a bug or a feature:-) All the best, Christian
participants (4)
-
Christian Wittern -
Christian Wittern -
Michel Pelletier -
Sin Hang Kin