[Zope-dev] [Petition] Kludge for Splitter.c (long)

LEE, Kwan Soo kslee@plaza1.snu.ac.kr
Mon, 17 Jan 2000 11:14:45 +0900


Hi, I've been try to (mis)understand why ZCatalog does not support Full Text Search for Non-Latin character text. Being Ignorant of C or C++(even regex), I did with pure hodgepodge trial and error approach, sigh ...

Then I come to the Splitter.py module(at the end of message) which seemed to work (with obvious limitations) on Zope 2.0.1 Windows binary installation. It did Full Text Search for Korean texts.

But IT FAILED with Zope 2.1.2 Linux binary Installation. So now my 2 petitions(please, correct me if the petitions based on my day-dream).

1. How about a dumb kludge Splitter.c which treats the characters in the user-specifiable/configuable list as white space and all other characters upto char(255) as meaningfull character and splits the text. 

In Korean, the current approach based on 'stem' words and 'stop' words will simple not work. For we have quite different writing convention. I guess many other (small) languages have simillar problems. Still, Full Text Search capabilities are so valuable to live without it. 

Furthermore, what if a Zope site contains documents in many languages? I guess the approach based on _ONE_ locale will not work greatly. Does one need several personalities of Splitter?

Before the "Full I18N/Localization Support"(I'm not sure what that mean ...) of Python & ZOPE,  a (maybe unsupported or community supported) kludge Splitter module with adequate warning may relieve the lives of lots of none-English/European Language Zopistas. 

2. Can any one eplain(or give the clue of) the difference of SearchIndex/ZCatalog i Zope 2.0.x and 2.1.x? Especially the role of subindex in TextIndex.py and UnTextIndex.py? My Splitter.py gets errors whenever subindex is related.

TIA

LEE, Kwan Soo

#####
#Splitter.py
#####

import string

t_t=['',]*256
for i in range(256):
        if i < 48 or 57 < i < 64: t_t[i] = ' '
        else: t_t[i]= chr(i)
tt=string.join(t_t,'')

class Splitter:
        """w/o stop word list support"""

        def __init__(self, isrc, *kw):
                self.isrc = string.translate(isrc, tt)
                tempsrc = string.split(isrc, ' ')
                xx=[]
                for x in tempsrc:
                        if x: xx.append(x)
                self.src = xx

        def __getslice__(self, a, b):
                return self.src[a:b]

        def __getitem__(self, a):
                return self.src[a]

        def __len__(self):
                return len(self.src)

        def indexes(self, a):
                res =[]
                for i in range(len(self)):
                        if self[i]==a:
                                res.append(i)
                return res

        def pos(self, a):
                i = int(a/2)
                x = string.find(self.isrc, self[i])
                return (x, x+len(self[i]) )