How about some feedback on this? We need some folks outside of the US to flex this muscle a little. ;)
-----Original Message----- From: Martijn Faassen [mailto:m.faassen@vet.uu.nl] Sent: Tuesday, September 14, 1999 7:33 AM Cc: zope-dev@zope.org Subject: Re: [Zope-dev] Re: [Zope] Need a list of words not indexed by Catalog
Rik Hoekstra wrote:
Terrel Shumway wrote:
near the end of lib/python/SearchIndex/TextIndex.py is a list called 'stop_words'
[Zope Dev] It would be good to move this out of the .py
file into an
editable, internationalizable resource file.
Agreed! And then there's the *multi* lingual issue too. What if I have Dutch and English on my site? [snip] It seems like you run into a _lot_ of complexities with multilingual issues, and still these are real issues for many of us.
Yes, very real issues. Suddenly ZCatalog isn't the almost-ready tool to add searchability to the website I'm building anymore.. Now I need to do quite a bit of extra work, I imagine..
I am thinking heavily about this very problem as we speak. You all correctly pointed out some of the toughest of the problems. Here are my ideas so far: Have 'vocabulary objects' store the stopwords, synonyms, stemming rules, and lexicon (collection of uniquely indexed words) in a drop-in object for ZCatalog. This way, a 'French', 'Dutch' etc. vocabulary object could be developed by a third party. TextIndexes can then reference (or acquire) a vocabular object through which it can stop, syn, stem and store words in it's lexicon. There are many other issues like sharing lexicons between similar language indexes, and having multiple back-end 'index/vocabularies' that all look like one index, so you can search a 'document source' for either 'community' or 'communauté' or 'Gemeinschaft' and get only documents relevant to that language (my applogies if these words are wrong, I'm using babelfish). I think this problem could be intractable though, if you searched for 'walking' in english, the word would stem down into 'walk', if you search for 'marche' en francais, should it stem down to 'promenade'? Anyways, there is some good news. For those of you tracking CVS we have added the ability to set your locale in Zope. This means that, forexample, the splitter/stemmer in the catalog will recognize all of those umlauts and accented letters and whatnots that english doesn't have. We would like a few people all over the place to try this out. If your locale has a different language or monetary system than the US (just about everywhere except some of canada) this might make the catalog and other parts of Zope more useful for you. local can be activated from the z2.py command like with the '-L' option. "-L ''" (an empty string) will cause local to try and autodetect your locale from your environment variables (you must set the env variables yourself, see 'man 7 locale'). Alternativly, you can say "-L de" and set your local to German. Please folks, test this out for us. We don't really have the means to do it here. -Michel _______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://www.zope.org/mailman/listinfo/zope-dev (To receive general Zope announcements, see: http://www.zope.org/mailman/listinfo/zope-announce For non-developer, user-level issues, zope@zope.org, http://www.zope.org/mailman/listinfo/zope )
On Wed, 15 Sep 1999, Michel Pelletier wrote:
TextIndexes can then reference (or acquire) a vocabular object through which it can stop, syn, stem and store words in it's lexicon. There are many other issues like sharing lexicons between similar language indexes, and having multiple back-end 'index/vocabularies' that all look like one index, so you can search a 'document source' for either 'community' or 'communaut�' or 'Gemeinschaft' and get only documents relevant to that language (my applogies if these words are wrong, I'm using babelfish). I think this problem could be intractable though, if you searched for 'walking' in english, the word would stem down into 'walk', if you search for 'marche' en francais, should it stem down to 'promenade'?
I'm not sure the example give here is correct. If you wanted to search for the equivalent of 'walking' in French, it would be 'marchand', the stem of which is 'march'. The correct conjugation of the verb is: marcher (to walk) je marche (I walk) tu marches (you walk) il/elle marche (he/she/it walks) nous marchons (we walk) vous marchez (you walk) So, at least for this verb, there is a common stem. 'Promenade' is actually a noun, not a verb AFAIK. Off the top of my head I can't think of a verb that would break this pattern, but it's been a while since I've studied French. Nick Garcia | ngarcia@codeit.com CodeIt Computing | http://codeit.com
On Wed, 15 Sep 1999, Michel Pelletier wrote:
TextIndexes can then reference (or acquire) a vocabular object through which it can stop, syn, stem and store words in it's lexicon. There ar=
e
many other issues like sharing lexicons between similar language indexes, and having multiple back-end 'index/vocabularies' that all loo=
k
like one index, so you can search a 'document source' for either 'community' or 'communaut=E9' or 'Gemeinschaft' and get only documents relevant to that language (my applogies if these words are wrong, I'm using babelfish). I think this problem could be intractable though, if you searched for 'walking' in english, the word would stem down into 'walk', if you search for 'marche' en francais, should it stem down to 'promenade'?
I'm not sure the example give here is correct. If you wanted to search for the equivalent of 'walking' in French, it would be 'marchand', the stem of which is 'march'. The correct conjugation of the verb is:
marcher (to walk) je marche (I walk) tu marches (you walk) il/elle marche (he/she/it walks) nous marchons (we walk) vous marchez (you walk)
Just to add fuel to the flame..In Italian, there is a set of verbs that is under Irregular verbs. So you got this type of 'irregularites' andare (to go) Io vado (I go) Tu vai (you go) Lui/Lei va (he/she/it goes) Noi andiamo (we go) Voi andate (you (plu) go) Loro andanno (they walk) So the stem word is not exactly precise. (va ?) (and ?). I belive Spanish would have this kind of verbs as well. Off course this kind of word aren't many compared to the rest of verb structure in the Italian language. dody italy
So, at least for this verb, there is a common stem. 'Promenade' is actually a noun, not a verb AFAIK.
Off the top of my head I can't think of a verb that would break this pattern, but it's been a while since I've studied French.
Nick Garcia | ngarcia@codeit.com CodeIt Computing | http://codeit.com
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://www.zope.org/mailman/listinfo/zope-dev (To receive general Zope announcements, see: http://www.zope.org/mailman/listinfo/zope-announce For non-developer, user-level issues,=20 zope@zope.org, http://www.zope.org/mailman/listinfo/zope ) Nick Garcia wrote:
This message uses a character set that is not supported by the Internet Service. To view the original message content, open the attached message. If the text doesn't display correctly, save the attachment to disk, and then open it using a viewer that can display the original character set. <<message.txt>>
------------------------------------------------------------------------ Name: message.txt message.txt Type: Plain Text (text/plain) Encoding: quoted-printable
Michel Pelletier wrote:
<snip><snip>
I am thinking heavily about this very problem as we speak. You all correctly pointed out some of the toughest of the problems. Here are my ideas so far:
Have 'vocabulary objects' store the stopwords, synonyms, stemming rules, and lexicon (collection of uniquely indexed words) in a drop-in object for ZCatalog. This way, a 'French', 'Dutch' etc. vocabulary object could be developed by a third party.
Sense this is a ZCatalog issue, and ZCatalog looks like it might be used alot in large sites, what is the feasibility of the Wordnet project's data and data model being used to enable "Smart Searching" for "future" multilingual searching. The stopwords would be up to Zope and if the EuroWordnet data was available then that language could be searched just like english. EuroWordnet project is creating wordnets databases in something like 21+ languages. (Their main site is not working...) the Wordnet project's site is: http://www.cogsci.princeton.edu/~wn/ there is a python API for Wordnet avialable at: http://www.cs.brandeis.edu/~steele/sources/wordnet-python.html Just wanted to get the idea out and see what you think??? (Sense we're all asleep, I might not get any responses :) David, tone.. <snip>
David Kankiewicz wrote:
Michel Pelletier wrote:
<snip><snip>
I am thinking heavily about this very problem as we speak. You all correctly pointed out some of the toughest of the problems. Here are my ideas so far:
Have 'vocabulary objects' store the stopwords, synonyms, stemming rules, and lexicon (collection of uniquely indexed words) in a drop-in object for ZCatalog. This way, a 'French', 'Dutch' etc. vocabulary object could be developed by a third party.
Sense this is a ZCatalog issue, and ZCatalog looks like it might be used alot in large sites, what is the feasibility of the Wordnet project's data and data model being used to enable "Smart Searching" for "future" multilingual searching. The stopwords would be up to Zope and if the EuroWordnet data was available then that language could be searched just like english.
EuroWordnet project is creating wordnets databases in something like 21+ languages. (Their main site is not working...)
the Wordnet project's site is: http://www.cogsci.princeton.edu/~wn/
there is a python API for Wordnet avialable at: http://www.cs.brandeis.edu/~steele/sources/wordnet-python.html
Just wanted to get the idea out and see what you think??? (Sense we're all asleep, I might not get any responses :)
It's huge..wow..great stuff.. one thing that bugs me, how do you tell a piece of software which language does 'a word' belong to ? An additional syntax is probably needed in the search string for a large site that contains multi language documents. Something like "via Real Madrid[english]", which would only search the word in the "english" context. Then again, how do you give the Content of your site a language context (document especially, HTML file probably thru Meta tags). So that the search would only need to be performed in the content that suits the language context. dody
David, tone..
<snip>
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://www.zope.org/mailman/listinfo/zope-dev
(To receive general Zope announcements, see: http://www.zope.org/mailman/listinfo/zope-announce
For non-developer, user-level issues, zope@zope.org, http://www.zope.org/mailman/listinfo/zope )
participants (4)
-
David Kankiewicz -
Dody Gunawinata -
Michel Pelletier -
Nick Garcia