Re: future searching of chinese text... (was: that JPython thing)
Moved to zope-dev. chas wrote:
3. support of unicode
Ah, as soon as python does, we do. Also, soon there will be a Japanese Vocabulary to support Japanese searching of text, and after that we are going to try Chinese. In Zope 2.2, using these examples, you can create a Vocabulary object for the language of your preference.
Just on this subject :
a) You may find the Perl mandarin text-splitter at www.mandarintools.com
Yes I found those recently.
very useful. I rewrote it in Python once but my Perl sucks so you may wish to do this yourself. Otherwise, e-mail me for copy ... it's so short I'm sure you'd rather not trust mine though.
I don't know Perl at all so I'm probably not going to investigate this, unless the algorithm is pretty simple. Also the morphological dictionary looks like it can get big, I'd like to use a more advanced data structure than just a flat file dictionary; the splitter is used very often, and constantly refering to a flat file would be horrible. I was probably going to use their dictionary to start with.
b) What algorithm do you use in your searching of text ? Is it just a simple frequency tally ?
I'm not sure what a simple frequency tally is. A text index is a mapping from a word id to a sequence of (documentid, score) tuples. o The word id (which is an integer) is mapped to a word by the Lexicon, o the document_id (which is also an integer) is mapped to a 'document' (and Zope object really) by the Catalog. o The score is the number of times the word that maps to the word id was found in the document that maps to the document id. Text indexes do not make any assumption about what a 'word' is, or how it came to be a word. All language specific information, such as how to split a document in a certain language into words, is handled by the Lexicon. The splitting of a 'document' into words is done by the Splitter object provided by the Lexicon. The Splitter object is what would need to implement the morphological analysis algorithm for whatever language you are going for. In the case of english, this is very simple and involves splitting a document up at whitespaces. The algorithm for Chinese or Japanese is much harder, as you're aware. -Michel
Hi Michel,
Moved to zope-dev.
Oops, I'm not on zope-dev, so thanks for cc'ing.
very useful. I rewrote it in Python once but my Perl sucks so you may wish to do this yourself. Otherwise, e-mail me for copy ... it's so short I'm sure you'd rather not trust mine though.
I don't know Perl at all so I'm probably not going to investigate this, unless the algorithm is pretty simple.
It is. It's really just a lot of if/else's. (Btw, I also suspect there's a flaw in the logic on one of the loops but it seems to produce accurate results. I say 'seems' b/c I don't read much Chinese and had to rely on people telling me if the results were correct - scary, I know, but several MNC's didn't complain.)
Also the morphological dictionary looks like it can get big, I'd like to use a more advanced data structure than just a flat file dictionary; the splitter is used very often, and constantly refering to a flat file would be horrible. I was probably going to use their dictionary to start with.
You're right about the dictionary. My Python scripts took 3-4 seconds to read in the dictionary and create the Python lists/dictionaries. Once this was read in, the splitting was fast. This was acceptable for the indexer/spider but not for the search field. I tried to use a dbf file, then a pickle but that didn't help that much.. although I suspect that's b/c I screwed up somewhere along the line. This is also why one of my first posts to the Zope list (months ago) asked if it were possible to read this into memory once and once only... ie. whether external methods are persistent.
b) What algorithm do you use in your searching of text ? Is it just a simple frequency tally ?
I'm not sure what a simple frequency tally is.
I made the term up :) I just meant exactly what you wrote below - the score for a document being the frequency of the word in the document. If I were happy with this mechanism, then I'd be use the ZCatalog for searching. Unfortunately, I need something a little better and, inspired by an article I read about google, have been researching datamining, and other mechanisms for superposing accuracy... but starting to realize it's a bit beyond me. :(
Text indexes do not make any assumption about what a 'word' is, or how it came to be a word. All language specific information, such as how to split a document in a certain language into words, is handled by the Lexicon.
The splitting of a 'document' into words is done by the Splitter object provided by the Lexicon. The Splitter object is what would need to implement the morphological analysis algorithm for whatever language you are going for. In the case of english, this is very simple and involves splitting a document up at whitespaces. The algorithm for Chinese or Japanese is much harder, as you're aware.
Yes, and neither you nor I really want to rebuild that algorithm I think. There are several works out on the web (mandarintools.com is just one of them) that have dealt with this; this might be of interest : http://casper.beckman.uiuc.edu/~c-tsai4/chinese/wordseg/mmseg.html#Abstract (it's actually moved but, being in China, I can't access geocities/xoom etc) chas
participants (2)
-
chas -
Michel Pelletier