GlobbingLexicon and Support for single-letter terms
I figured I would post this to this list, for whoever it might be useful for... I needed to have ZCatalog searches support single letter words in Globbing text indexes in an online classified ad system. I had a need to do searches like 'C programmer' which would need to have the letter C indexed. I had to edit Splitter.c to support this without treating them as stop-words (in addition to supporting numbers, by changing isalpha to isalnum). On top of that, though, I had to make a change to GlobbingLexicon.get() as well. If you try to search on a single letter word, it tries to create a second digram that uses an index of i+1, which doesn't exist, and an 'index out of range' exception occurs. If you have compiled Splitter.c to support words of a length of 1, the following change to GlobbingLexicon.py will actually allow you to search without error. My line number is based upon source from the 2.3.2 source release and may not reflect current source in current releases and/or CVS: Previously line 224 is: digrams.append((pattern[i] + pattern[i+1])) Replacing line 224 with: if (len(pattern) != 1): digrams.append((pattern[i] + pattern[i+1])) Also, if anyone is interested, I parse and extend queries to auto-add wildcards before passing the query to Catalog. This allows me to add wildcards to the end of words appropriately according to their length. I want 'tech' to match 'biotechnology' but I don't want 'C' (as in programmer) to match every word with the letter C in it or even every word with the letter C as it's first character, so I have to be careful about the length. I have it set up so that: - terms with len > 3 --> *[term]* - terms with len of 2 or 3 --> [term]? - terms with len of 1 --> [term] (don't get re-written) def queryExtender(self, query): """ Takes, as input, query for Text index of ZCatalog, and makes it more intelligent by parsing it and rewriting it to include wildcards at the end of words so that we can search sub-words; in other words, a search for something like "engineer" should yield results for "engineer*" so that terms like "engineers" and "engineering" also are Obviously, we have to be careful not to incorrectly parse the query, and we don't want to mess with words that already have wildcards at the end, because you don't want to end up with something like "engineer**" """ ### Define Character Patterns to Strip Out and Split Upon everythingButSearchTerms = '[^A-Za-z0-9*]+' #Regex Pattern ### Create the word list result = re.split(everythingButSearchTerms, query) ### Get rid of empty string elements in the word list try: for i in range(result.count('')): result.remove('') except: pass ### Get rid of boolean operators i=0 #count variable, used for indexing for item in result: if re.search(booleanops, item): result.pop(i) i = i + 1 ### Now, result is a list of just the words that are ### meaningful to the search, but we need to eliminate ### any entries that have wildcards in them, because ### they are likely more specific than our rewrite here asteriskinterm = '(^[*])|([*]$)$' #asterisk at start or end of term i=0 #count variable, used for indexing for item in result: if re.search(asteriskinterm, item): result.pop(i) i = i + 1 ### Now, the list of words in the query we need to modify is ### final, so we can start modifying the queries, one word ### at a time... for item in result: #query = re.sub(item, '*'+item+'*', query, count=1) if (len(item) > 3): query = re.sub(item, item+'*', query, count=1) else: if (len(item) != 1): query = re.sub(item, item+'?', query, count=1) return query Sean ========================= Sean Upton Senior Programmer/Analyst SignOnSanDiego.com The San Diego Union-Tribune 619.718.5241 sean.upton@uniontrib.com =========================
participants (1)
-
sean.upton@uniontrib.com