GlobbingLexicon and Support for single-letter terms

2 Aug 2001

      I figured I would post this to this list, for whoever it might be useful
for...

I needed to have ZCatalog searches support single letter words in Globbing
text indexes in an online classified ad system.  I had a need to do searches
like 'C programmer' which would need to have the letter C indexed.

I had to edit Splitter.c to support this without treating them as stop-words
(in addition to supporting numbers, by changing isalpha to isalnum).  On top
of that, though, I had to make a change to GlobbingLexicon.get() as well.
If you try to search on a single letter word, it tries to create a second
digram that uses an index of i+1, which doesn't exist, and an 'index out of
range' exception occurs.

If you have compiled Splitter.c to support words of a length of 1, the
following change to GlobbingLexicon.py will actually allow you to search
without error.  My line number is based upon source from the 2.3.2 source
release and may not reflect current source in current releases and/or CVS:

Previously line 224 is:
                digrams.append((pattern[i] + pattern[i+1]))
Replacing line 224 with:
                if (len(pattern) != 1):
                   digrams.append((pattern[i] + pattern[i+1]))

Also, if anyone is interested, I parse and extend queries to auto-add
wildcards before passing the query to Catalog.  This allows me to add
wildcards to the end of words appropriately according to their length.  I
want 'tech' to match 'biotechnology' but I don't want 'C' (as in programmer)
to match every word with the letter C in it or even every word with the
letter C as it's first character, so I have to be careful about the length.
I have it set up so that:

- terms with len > 3 		--> *[term]*
- terms with len of 2 or 3 	--> [term]?
- terms with len of 1 		--> [term] (don't get re-written)

        def queryExtender(self, query):
                """
                Takes, as input, query for Text index of ZCatalog, and
                makes it more intelligent by parsing it and rewriting it
                to include wildcards at the end of words so that we can
                search sub-words; in other words, a search for something
                like "engineer" should yield results for "engineer*" so
                that terms like "engineers" and "engineering" also are
                Obviously, we have to be careful not to incorrectly
                parse the query, and we don't want to mess with words
                that already have wildcards at the end, because you
                don't want to end up with something like "engineer**"

                """

                ### Define Character Patterns to Strip Out and Split Upon
                everythingButSearchTerms = '[^A-Za-z0-9*]+' #Regex Pattern

                ### Create the word list
                result = re.split(everythingButSearchTerms, query)

                ### Get rid of empty string elements in the word list
                try:
                    for i in range(result.count('')):
                        result.remove('')
                except:
                    pass

                ### Get rid of boolean operators

                i=0 #count variable, used for indexing
                for item in result:
                        if re.search(booleanops, item):
                                result.pop(i)
                        i = i + 1

                ### Now, result is a list of just the words that are
                ### meaningful to the search, but we need to eliminate
                ### any entries that have wildcards in them, because
                ### they are likely more specific than our rewrite here
                asteriskinterm = '(^[*])|([*]$)$'
                                 #asterisk at start or end of term

                i=0 #count variable, used for indexing
                for item in result:
                        if re.search(asteriskinterm, item):
                                result.pop(i)
                        i = i + 1

                ### Now, the list of words in the query we need to modify is
                ### final, so we can start modifying the queries, one word
                ### at a time...
                for item in result:
                       #query = re.sub(item, '*'+item+'*', query, count=1)
                       if (len(item) > 3):
                          query = re.sub(item, item+'*', query, count=1)
                       else:
                          if (len(item) != 1):
                             query = re.sub(item, item+'?', query, count=1)
                return query

Sean

=========================
Sean Upton
Senior Programmer/Analyst
SignOnSanDiego.com
The San Diego Union-Tribune
619.718.5241
sean.upton@uniontrib.com
=========================

sean.upton＠uniontrib.com

tags

participants (1)