What catalog/index to use ...
Hi! Currently there are at least three options for doing full text indexing with ZCatalog: - good old TextIndex - ZCTextIndex - TextIndexNG TextIndex basically works fine for me and handles German umlauts well (if you use the right locale settings in the Zope start skript), but ZCTextIndex is generally better, except that it does not handle umlauts correctly as far as I can see. So without a bug fix ZCTextIndex is good for US, but not for us ;-) Then there is Andreas Jung's TextIndexNG, which seems to be really impressive. What are the plans for Zope 2.6.x/2.7? Will ZCTextIndex be replaced by TextIndexNG? Does it make sense to get ZCTextIndex fixed (there seems to be a patch in the collector already) or should I go with TextIndexNG? If yes, is it ready for production environments? Cheers Joachim _________________________ Joachim Werner iuveno AG Wittelsbacherstraße 23b 90475 Nürnberg Tel. +49 (0) 911 / 988398-4 Fax +49 (0) 911 / 988398-5 Mail: joachim.werner@iuveno.de WWW: http://www.iuveno.de
--On Freitag, 8. November 2002 01:44 +0100 Joachim Werner <joe@iuveno-net.de> wrote:
What are the plans for Zope 2.6.x/2.7? Will ZCTextIndex be replaced by TextIndexNG?
No, they will coexist. ZCTextIndex is maintained by Zope Corp, TextIndexNG is maintained by myself.
Does it make sense to get ZCTextIndex fixed (there seems to be a patch in the collector already) or should I go with TextIndexNG? If yes, is it ready for production environments?
Depends on your needs. ZCTextIndex is very easy to use and supports relevance ranking, TextIndexNG is supposed to be some kind of eier-legende-wollmilch-sau. Compare the features and make your choice. -aj
Depends on your needs. ZCTextIndex is very easy to use and supports relevance ranking, TextIndexNG is supposed to be some kind of eier-legende-wollmilch-sau. Compare the features and make your choice.
-aj
isn't TextIndexNG much better with international character encodings and that stuff? and it has a lot more stemmers for various languages. jens
In the original design of ZCTextIndex we (PythonLabs mostly) considered stemming and found that it has been found to have dubious value in many information theorists views (The fact that Google does no stemming was also a factor in the decision). So we decided to leave it out entirely. ZCTextIndex is extensible and third parties can add additional text processing facilites (called pipeline elements) to the system without modifying ZCTextIndex. This could be a way to add stemming and any other conceivable feature involving preprocessing the index source and query text. Granted that feature could use better(!) documentation... (I should just add that to my email sig ;^) -Casey On Friday 08 November 2002 08:19 am, Jens Vagelpohl wrote:
Depends on your needs. ZCTextIndex is very easy to use and supports relevance ranking, TextIndexNG is supposed to be some kind of eier-legende-wollmilch-sau. Compare the features and make your choice.
-aj
isn't TextIndexNG much better with international character encodings and that stuff? and it has a lot more stemmers for various languages.
jens
ZCTextIndex is to become the full replacement for old TextIndex. There are a couple of outstanding patches for making the ZCTextIndex splitter, etc. locale friendly. Whether those solve your problem I don't know. We are happy to improve ZCTextIndex for international use, however we at Zope Corp are not authorities in the matter, so we will require some help from those who are. If you can create a collector issue that illustrates the problems you have experienced (perhaps posting some sample content), that we be a great start. -Casey On Thursday 07 November 2002 07:44 pm, Joachim Werner wrote:
Hi!
Currently there are at least three options for doing full text indexing with ZCatalog:
- good old TextIndex - ZCTextIndex - TextIndexNG
TextIndex basically works fine for me and handles German umlauts well (if you use the right locale settings in the Zope start skript), but ZCTextIndex is generally better, except that it does not handle umlauts correctly as far as I can see. So without a bug fix ZCTextIndex is good for US, but not for us ;-)
Then there is Andreas Jung's TextIndexNG, which seems to be really impressive.
What are the plans for Zope 2.6.x/2.7? Will ZCTextIndex be replaced by TextIndexNG?
Does it make sense to get ZCTextIndex fixed (there seems to be a patch in the collector already) or should I go with TextIndexNG? If yes, is it ready for production environments?
Cheers
Joachim
_________________________
Joachim Werner
iuveno AG Wittelsbacherstraße 23b 90475 Nürnberg
Tel. +49 (0) 911 / 988398-4 Fax +49 (0) 911 / 988398-5
Mail: joachim.werner@iuveno.de WWW: http://www.iuveno.de
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Joachim Werner wrote:
Does it make sense to get ZCTextIndex fixed (there seems to be a patch in the collector already) or should I go with TextIndexNG? If yes, is it ready for production environments?
hi, I've submitted a patch for locale-support for ZCTextIndex: http://collector.zope.org/Zope/597 I run this patch on a production site [most content is german] without any problems. I think, if all test-cases for ZCTextIndex succeed with this patch, it should be merged into the next official release so all european zopers can use ZCTextIndex without patching it... in my opinion: high priority for this one done :-) cheers, maik
The main reason I have not merged this already is that I lack a sample to make a new test with. If someone can provide me with some content samples that break now, but work with the patch, I will make a new test and checkin the fix for 2.7 perhaps 2.6.1 if desired. -Casey On Friday 08 November 2002 11:27 am, Maik Jablonski wrote:
Joachim Werner wrote:
Does it make sense to get ZCTextIndex fixed (there seems to be a patch in the collector already) or should I go with TextIndexNG? If yes, is it ready for production environments?
hi,
I've submitted a patch for locale-support for ZCTextIndex:
http://collector.zope.org/Zope/597
I run this patch on a production site [most content is german] without any problems. I think, if all test-cases for ZCTextIndex succeed with this patch, it should be merged into the next official release so all european zopers can use ZCTextIndex without patching it... in my opinion: high priority for this one done :-)
cheers, maik
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Casey Duncan wrote:
The main reason I have not merged this already is that I lack a sample to make a new test with. If someone can provide me with some content samples that break now, but work with the patch, I will make a new test and checkin the fix for 2.7 perhaps 2.6.1 if desired.
-Casey
hi Casey, try some words with "german umlaute". things like: mülltonne waschbär behörde überflieger the last one will work without the patch. explanation: the first character is splitted away [non-ascii-character] [both for storing the word in the Lexicon and resolving the query-words through the queryparser]. so it will in both cases end up in berflieger searching for 'überflieger' will give you correct results. this is the reason, why some people think, that ZCTextIndex works with german 'umlaute', but it does not...;-) hope this is helpful. cheers, maik
Hi! Some additional remarks: While making the splitting dependent on the locale settings (as done in the old TextIndex) helps with most use cases, I'm not sure if that is the right thing to do in the long run. Locale settings are good for client software, i.e. if you want to have a program behave German for a German user etc. But a web server might be located in the U.S., but frequented by German-speaking and Spanish-speaking users. Or even users from China or Japan. In these cases only Unicode will help I think. After all you can not have more than one locale at a time. But honestly I still don't understand the Unicode thing good enough. My main concern is whether a Unicode-enabled site will still work with older browsers and for all platforms ...
Casey Duncan wrote:
The main reason I have not merged this already is that I lack a sample to make a new test with. If someone can provide me with some content samples that break now, but work with the patch, I will make a new test and checkin the fix for 2.7 perhaps 2.6.1 if desired.
-Casey
hi Casey,
try some words with "german umlaute". things like:
mülltonne waschbär behörde überflieger
the last one will work without the patch. explanation: the first character is splitted away [non-ascii-character] [both for storing the word in the Lexicon and resolving the query-words through the queryparser]. so it will in both cases end up in
berflieger
searching for 'überflieger' will give you correct results. this is the reason, why some people think, that ZCTextIndex works with german 'umlaute', but it does not...;-)
This patch will probably not hurt anybody. And it would make ZCTextIndex behave like TextIndex. OT: I don't want to be too pedantic about that, but usually I'd expect a replacement to really replace all of the functionality of the thing it replaces. TextIndex was locale-aware (and this was even documented somewhere), so switching to ZCTextIndex should not break anything, at least not in a Zope final. But that's what I've told you all the time: Why do you make things final releases before they are really tested? 2.6.0 has a really bad bug with the DateTime module (Lennart Regebro has provided a fix for it: http://www.zope.org/Members/regebro/datetime_260_fix) that was introduced after 2.6.0b1. This just shouldn't be possible ... Anybody listening? ;-) Cheers Joachim
--On Freitag, 8. November 2002 19:49 +0100 Joachim Werner <joe@iuveno-net.de> wrote:
Hi!
Some additional remarks: While making the splitting dependent on the locale settings (as done in the old TextIndex) helps with most use cases, I'm not sure if that is the right thing to do in the long run. Locale settings are good for client software, i.e. if you want to have a program behave German for a German user etc.
But a web server might be located in the U.S., but frequented by German-speaking and Spanish-speaking users. Or even users from China or Japan. In these cases only Unicode will help I think. After all you can not have more than one locale at a time. But honestly I still don't understand the Unicode thing good enough. My main concern is whether a Unicode-enabled site will still work with older browsers and for all platforms ...
Please note that former Zope versions already include a dedicated unicode-aware splitter that is already usable with the old TextIndex and maybe with ZCTextIndex. TextIndexNG resolves all these issues by doing the complete internal processing by converting the data into unicode. Every single processing step only handles unicode data. Most older browsers should be able to handle at least UTF-8 as character set. This is sufficient for most cases. =aj
Hi!
Please note that former Zope versions already include a dedicated unicode-aware splitter that is already usable with the old TextIndex and maybe with ZCTextIndex. TextIndexNG resolves all these issues by doing the complete internal processing by converting the data into unicode. Every single processing step only handles unicode data.
Most older browsers should be able to handle at least UTF-8 as character set. This is sufficient for most cases.
The problem seems to be that ZCTextIndex indeed does not do the splitting "right" if German Umlauts are used. There is no option for "Unicode-aware splitter". Instead of a Vocabulary it uses a Lexicon, which just offers two options: "HTML aware splitter" and "Whitespace splitter". I haven't tested the whitespace splitter yet, but the HTML aware splitter did not do the Umlaut thing right without the patch, i.e. it used umlauts as splitting characters ... So there is a bug ... Joachim
The problem seems to be that ZCTextIndex indeed does not do the splitting "right" if German Umlauts are used. There is no option for "Unicode-aware splitter". Instead of a Vocabulary it uses a Lexicon, which just offers two options: "HTML aware splitter" and "Whitespace splitter". I haven't tested the whitespace splitter yet, but the HTML aware splitter did not do the Umlaut thing right without the patch, i.e. it used umlauts as splitting characters ...
That's just what the default ZMI interface for ZCTextIndex offers. It's easy to add your own splitter by writing a few lines of Python code. RTSL. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
The problem seems to be that ZCTextIndex indeed does not do the splitting "right" if German Umlauts are used. There is no option for "Unicode-aware splitter". Instead of a Vocabulary it uses a Lexicon, which just offers two options: "HTML aware splitter" and "Whitespace splitter". I haven't tested the whitespace splitter yet, but the HTML aware splitter did not do the Umlaut thing right without the patch, i.e. it used umlauts as splitting characters ...
That's just what the default ZMI interface for ZCTextIndex offers. It's easy to add your own splitter by writing a few lines of Python code. RTSL.
of course everyone can write his own Splitter... one for german, one for french, etc.pp. but what is the problem with the patch? is pythons-regexp (?L) not just intended for this simple way of "localizing" software? and think of the european market: no one will "buy" Zope, if it is not working with your native language out of the box. and that's what the patch for... cheers, maik -- Maik Jablonski __o www.zfl.uni-bielefeld.de _ \<_ Deutsche Zope User Group Bielefeld, Germany (_)/(_) www.dzug.org
The problem seems to be that ZCTextIndex indeed does not do the splitting "right" if German Umlauts are used. There is no option for "Unicode-aware splitter". Instead of a Vocabulary it uses a Lexicon, which just offers two options: "HTML aware splitter" and "Whitespace splitter". I haven't tested the whitespace splitter yet, but the HTML aware splitter did not do the Umlaut thing right without the patch, i.e. it used umlauts as splitting characters ...
That's just what the default ZMI interface for ZCTextIndex offers. It's easy to add your own splitter by writing a few lines of Python code. RTSL.
of course everyone can write his own Splitter... one for german, one for french, etc.pp. but what is the problem with the patch? is pythons-regexp (?L) not just intended for this simple way of "localizing" software?
and think of the european market:
no one will "buy" Zope, if it is not working with your native language out of the box. and that's what the patch for...
I must've missed the start of this thread (I only just signed up for this list). I didn't see any patch -- I thought it was just a gripe about ZCTextIndex. Of course patches are welcome -- where can I find this particular patch? --Guido van Rossum (home page: http://www.python.org/~guido/)
I must've missed the start of this thread (I only just signed up for this list). I didn't see any patch -- I thought it was just a gripe about ZCTextIndex. Of course patches are welcome -- where can I find this particular patch?
Hi Guido! I don't know where you would expect a patch to be found, but in this particular case the Zope Collector is a good place to look: http://collector.zope.org/Zope/597 Use the collector, Luke! ;-) Joachim P.S.: I guess most of the people on the zope-dev list have some clue on how to write their own splitters, but the message of my "gripe" was that something worked o.k. (for the dumb end user) with the old TextIndex and doesn't with the thing that is advertised on the Add form as the replacement, and that just isn't cool.
I don't know where you would expect a patch to be found, but in this particular case the Zope Collector is a good place to look:
http://collector.zope.org/Zope/597
Use the collector, Luke! ;-)
Um, that's not a patch. Can you attach a context or unified diff to the collector item?
Joachim
P.S.: I guess most of the people on the zope-dev list have some clue on how to write their own splitters, but the message of my "gripe" was that something worked o.k. (for the dumb end user) with the old TextIndex and doesn't with the thing that is advertised on the Add form as the replacement, and that just isn't cool.
Indeed. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
I don't know where you would expect a patch to be found, but in this particular case the Zope Collector is a good place to look:
http://collector.zope.org/Zope/597
Use the collector, Luke! ;-)
Um, that's not a patch. Can you attach a context or unified diff to the collector item?
sorry for that. but the only difference between old and new code is the (?L)-flag in the reg-exps. cheers, maik
participants (6)
-
Andreas Jung -
Casey Duncan -
Guido van Rossum -
Jens Vagelpohl -
Joachim Werner -
Maik Jablonski