Modifying Splitter.c to search on '+' & '#', and single letter words
I have two problems with getting ZCatalog to search for what I need: 1) Need to be able to search for words like 'J++' and 'C#' - this is relatively simple to do by editing Splitter.c a little and recompiling 2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard So far I have solved problem (1) by changing the contents of Splitter.c, but that's a bit messy. Currently I don't know of an alternative though. I have modified Splitter.c so it indexes the extra characters, and reduced the mimimum word length to 1, which works fine when indexing, and I can see all the symbol-inclusive words and single-letter words in the vocabulary. Unfortunately, any search on a single-letter word gives an IndexError, "String out of range". I am stuck on problem (2) and don't know how to avoid the errors arising in GlobbingLexicon.py without editing in some kind of hack to get around it. I don't even know why GlobbingLexicon is getting involved in the search process since I am not trying to use wildcards and haven't elected to use a globbing vocabulary (AFAIK). Can anyone explain why GlobbingLexicon is involved? Better yet, has anyone faced this problem (2) before, or come up with a more elegant solution to (1) ? Thanks for your help :) Harry
Oh yes, I forgot the traceback of the error, so here it is: Traceback (innermost last): File /stuff/harry/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 223, in publish_module File /stuff/harry/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 187, in publish File /stuff/harry/Zope-2.3.2-src/lib/python/Zope/__init__.py, line 221, in zpublisher_exception_hook (Object: Traversable) File /stuff/harry/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 171, in publish File /stuff/harry/Zope-2.3.2-src/lib/python/ZPublisher/mapply.py, line 160, in mapply (Object: TSrep) File /stuff/harry/Zope-2.3.2-src/lib/python/ZPublisher/Publish.py, line 112, in call_object (Object: TSrep) File /stuff/harry/Zope-2.3.2-src/lib/python/OFS/DTMLMethod.py, line 189, in __call__ (Object: TSrep) File /stuff/harry/Zope-2.3.2-src/lib/python/DocumentTemplate/DT_String.py, line 538, in __call__ (Object: TSrep) File /stuff/harry/Zope-2.3.2-src/lib/python/DocumentTemplate/DT_In.py, line 484, in renderwb (Object: TestSplitter) File /stuff/harry/Zope-2.3.2-src/lib/python/Products/ZCatalog/ZCatalog.py, line 535, in searchResults (Object: Traversable) File /stuff/harry/Zope-2.3.2-src/lib/python/Products/ZCatalog/Catalog.py, line 657, in searchResults File /stuff/harry/Zope-2.3.2-src/lib/python/Products/ZCatalog/Catalog.py, line 542, in _indexedSearch File /stuff/harry/Zope-2.3.2-src/lib/python/SearchIndex/UnTextIndex.py, line 513, in _apply_index File /stuff/harry/Zope-2.3.2-src/lib/python/SearchIndex/UnTextIndex.py, line 576, in query File /stuff/harry/Zope-2.3.2-src/lib/python/SearchIndex/UnTextIndex.py, line 616, in evaluate File /stuff/harry/Zope-2.3.2-src/lib/python/SearchIndex/UnTextIndex.py, line 446, in __getitem__ File /stuff/harry/Zope-2.3.2-src/lib/python/SearchIndex/GlobbingLexicon.py, line 224, in get IndexError: (see above) Harry Wilkinson wrote:
I have two problems with getting ZCatalog to search for what I need:
1) Need to be able to search for words like 'J++' and 'C#' - this is relatively simple to do by editing Splitter.c a little and recompiling 2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard
So far I have solved problem (1) by changing the contents of Splitter.c, but that's a bit messy. Currently I don't know of an alternative though.
I have modified Splitter.c so it indexes the extra characters, and reduced the mimimum word length to 1, which works fine when indexing, and I can see all the symbol-inclusive words and single-letter words in the vocabulary. Unfortunately, any search on a single-letter word gives an IndexError, "String out of range".
I am stuck on problem (2) and don't know how to avoid the errors arising in GlobbingLexicon.py without editing in some kind of hack to get around it. I don't even know why GlobbingLexicon is getting involved in the search process since I am not trying to use wildcards and haven't elected to use a globbing vocabulary (AFAIK).
Can anyone explain why GlobbingLexicon is involved? Better yet, has anyone faced this problem (2) before, or come up with a more elegant solution to (1) ?
Thanks for your help :)
Harry
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Harry Wilkinson wrote:
2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard
You probably don't want to accommodate all single letter words -- just a couple of them. Why not provide an interface to you application where you pre-process stuff to be cataloged replacing "C" with "CTheLanguage", and doing the same for search criteria. Of course, "C" would be more like "(^|[^A-Za-z0-9])C($|^[A-Za-z0-9])" or something like that. -- Steve Alexander Software Engineer Cat-Box limited
Harry Wilkinson wrote:
2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard
You probably don't want to accommodate all single letter words -- just a couple of them.
For simplicities sake, we do ;-)
Why not provide an interface to you application where you pre-process stuff to be cataloged replacing "C" with "CTheLanguage", and doing the same for search criteria.
Of course, "C" would be more like "(^|[^A-Za-z0-9])C($|^[A-Za-z0-9])" or something like that.
That sounds mighty hacky :-S cheers, Chris
On Wed, 25 Jul 2001 22:52:52 +0100, Steve Alexander <steve@cat-box.net> wrote:
You probably don't want to accommodate all single letter words -- just a couple of them.
Well, its not like there are very many of them...... Toby Dickenson tdickenson@geminidataloggers.com
Toby Dickenson wrote:
On Wed, 25 Jul 2001 22:52:52 +0100, Steve Alexander <steve@cat-box.net> wrote:
You probably don't want to accommodate all single letter words -- just a couple of them.
Well, its not like there are very many of them......
26, well, in the current alphabet. cheers, Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: <tdickenson@geminidataloggers.com> Cc: "Steve Alexander" <steve@cat-box.net>; "Harry Wilkinson" <harryw@nipltd.com>; <zope-dev@zope.org> Sent: Donnerstag, 26. Juli 2001 06:45 Subject: Re: [Zope-dev] Modifying Splitter.c to search on '+' & '#', and single letter words
Toby Dickenson wrote:
On Wed, 25 Jul 2001 22:52:52 +0100, Steve Alexander <steve@cat-box.net> wrote:
You probably don't want to accommodate all single letter words -- just
a
couple of them.
Well, its not like there are very many of them......
26, well, in the current alphabet.
*sigh* Germans need some more letters: üöäß :-) Andreas
Andreas Jung wrote:
26, well, in the current alphabet.
*sigh* Germans need some more letters: üöäß :-)
...and no doubt they'll be unicode strings that you can't use as attribute names ;-) *double sigh* (that's a 16-bit sigh, fyi ;-) Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Andreas Jung" <andreas@andreas-jung.com> Cc: <tdickenson@geminidataloggers.com>; "Steve Alexander" <steve@cat-box.net>; "Harry Wilkinson" <harryw@nipltd.com>; <zope-dev@zope.org> Sent: Donnerstag, 26. Juli 2001 07:04 Subject: Re: [Zope-dev] Modifying Splitter.c to search on '+' & '#', and single letter words
Andreas Jung wrote:
26, well, in the current alphabet.
*sigh* Germans need some more letters: üöäß :-)
...and no doubt they'll be unicode strings that you can't use as attribute names ;-)
As described earlier Python attributes are ASCII only. And Zope uses setattr() for storing object properties. To allow unicode properties Zope has to be modified to store the (key,value) inside e.g. a dictionary. I am not sure if we could do that in a backward compatible way. Andreas
Andreas Jung wrote:
As described earlier Python attributes are ASCII only.
It's mroe strict than that, they appear to have to be 7 bit ASCII, hence any 8-bit characters like the German ones you mentioned cause barfage when used as attribute names :-( cheers, Chris
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Andreas Jung" <andreas@andreas-jung.com> Cc: <tdickenson@geminidataloggers.com>; "Steve Alexander" <steve@cat-box.net>; "Harry Wilkinson" <harryw@nipltd.com>; <zope-dev@zope.org> Sent: Donnerstag, 26. Juli 2001 09:37 Subject: Re: [Zope-dev] Modifying Splitter.c to search on '+' & '#', and single letter words
Andreas Jung wrote:
As described earlier Python attributes are ASCII only.
It's mroe strict than that, they appear to have to be 7 bit ASCII, hence any 8-bit characters like the German ones you mentioned cause barfage when used as attribute names :-(
I always thought ASCII is 7 bit. 8 bit ASCII is more ISO-8859-X :-) Andreas
http://www.jimprice.com/jim-asc.htm Everything you ever wanted to know about ASCII and probably more ;) I was looking at it the other day. Andreas Jung wrote:
----- Original Message ----- From: "Chris Withers" <chrisw@nipltd.com> To: "Andreas Jung" <andreas@andreas-jung.com> Cc: <tdickenson@geminidataloggers.com>; "Steve Alexander" <steve@cat-box.net>; "Harry Wilkinson" <harryw@nipltd.com>; <zope-dev@zope.org> Sent: Donnerstag, 26. Juli 2001 09:37 Subject: Re: [Zope-dev] Modifying Splitter.c to search on '+' & '#', and single letter words
Andreas Jung wrote:
As described earlier Python attributes are ASCII only.
It's mroe strict than that, they appear to have to be 7 bit ASCII, hence any 8-bit characters like the German ones you mentioned cause barfage when used as attribute names :-(
I always thought ASCII is 7 bit. 8 bit ASCII is more ISO-8859-X :-)
Andreas
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Harry Wilkinson wrote:
I have two problems with getting ZCatalog to search for what I need:
1) Need to be able to search for words like 'J++' and 'C#' - this is relatively simple to do by editing Splitter.c a little and recompiling 2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard
So far I have solved problem (1) by changing the contents of Splitter.c, but that's a bit messy. Currently I don't know of an alternative though.
I have modified Splitter.c so it indexes the extra characters, and reduced the mimimum word length to 1, which works fine when indexing, and I can see all the symbol-inclusive words and single-letter words in the vocabulary. Unfortunately, any search on a single-letter word gives an IndexError, "String out of range".
This is because the globbinglexicon never anticipated single letter patterns. This is a bug. Try this (untested) quick patch: Index: GlobbingLexicon.py =================================================================== RCS file: /cvs-repository/Zope2/lib/python/SearchIndex/GlobbingLexicon.py,v retrieving revision 1.9 diff -c -r1.9 GlobbingLexicon.py *** GlobbingLexicon.py 2001/04/02 18:19:45 1.9 --- GlobbingLexicon.py 2001/07/26 05:21:48 *************** *** 221,226 **** --- 221,229 ---- if i == 0: digrams.insert(i, (self.eow + pattern[i]) ) + if len(pattern) == 1: + digrams.append( (pattern[i] + self.eow) ) + break digrams.append((pattern[i] + pattern[i+1])) else: try:
I am stuck on problem (2) and don't know how to avoid the errors arising in GlobbingLexicon.py without editing in some kind of hack to get around it.
That's exactly what this patch does.
I don't even know why GlobbingLexicon is getting involved in the search process since I am not trying to use wildcards and haven't elected to use a globbing vocabulary (AFAIK).
You must have somehow, GlobbingLexicon is never the default. -Michel
This seems to work perfectly, thanks a lot :D I am pretty sure I'm not using a globbing vocabulary, I've tried deleting the test ZCatalog I was using and creating a new one and using the vocabulary it gives me. Is it meant to use GlobbingLexicon.py for all vocabularies? Well thanks again :) Harry Michel Pelletier wrote:
Harry Wilkinson wrote:
I have two problems with getting ZCatalog to search for what I need:
1) Need to be able to search for words like 'J++' and 'C#' - this is relatively simple to do by editing Splitter.c a little and recompiling 2) Need to be able to search for single-letter words like 'C' - this is easy to modify Splitter.c to accomodate, but causes errors in GlobbingLexicon.py, even though the vocabulary is standard
So far I have solved problem (1) by changing the contents of Splitter.c, but that's a bit messy. Currently I don't know of an alternative though.
I have modified Splitter.c so it indexes the extra characters, and reduced the mimimum word length to 1, which works fine when indexing, and I can see all the symbol-inclusive words and single-letter words in the vocabulary. Unfortunately, any search on a single-letter word gives an IndexError, "String out of range".
This is because the globbinglexicon never anticipated single letter patterns. This is a bug. Try this (untested) quick patch:
Index: GlobbingLexicon.py =================================================================== RCS file: /cvs-repository/Zope2/lib/python/SearchIndex/GlobbingLexicon.py,v retrieving revision 1.9 diff -c -r1.9 GlobbingLexicon.py *** GlobbingLexicon.py 2001/04/02 18:19:45 1.9 --- GlobbingLexicon.py 2001/07/26 05:21:48 *************** *** 221,226 **** --- 221,229 ----
if i == 0: digrams.insert(i, (self.eow + pattern[i]) ) + if len(pattern) == 1: + digrams.append( (pattern[i] + self.eow) ) + break digrams.append((pattern[i] + pattern[i+1])) else: try:
I am stuck on problem (2) and don't know how to avoid the errors arising in GlobbingLexicon.py without editing in some kind of hack to get around it.
That's exactly what this patch does.
I don't even know why GlobbingLexicon is getting involved in the search process since I am not trying to use wildcards and haven't elected to use a globbing vocabulary (AFAIK).
You must have somehow, GlobbingLexicon is never the default.
-Michel
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
participants (6)
-
Andreas Jung -
Chris Withers -
Harry Wilkinson -
Michel Pelletier -
Steve Alexander -
Toby Dickenson