[Zope-CVS] CVS: Products/ZCTextIndex - QueryParser.py:1.1.2.15

Guido van Rossum guido@python.org
Mon, 13 May 2002 00:53:04 -0400


Update of /cvs-repository/Products/ZCTextIndex
In directory cvs.zope.org:/tmp/cvs-serv17022

Modified Files:
      Tag: TextIndexDS9-branch
	QueryParser.py 
Log Message:
Change the query syntax to be more Google-like.

The AND, OR and NOT operators and parentheses are still recognized,
but in addition:

- a sequence of words without operators implies AND, e.g. ``foo bar''
- double-quoted text implies phrase search, e.g. ``"foo bar"''
- words connected by punctuation implies phrase search, e.g. ``foo-bar''
- a leading hyphen implies NOT, e.g. ``foo -bar''
- these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar''



=== Products/ZCTextIndex/QueryParser.py 1.1.2.14 => 1.1.2.15 ===
 Term = '(' OrExpr ')' | ATOM+
 
-An ATOM is a string not containing whitespace or parentheses, and not
-equal to one of the key words 'AND', 'OR', 'NOT'.  The key words are
-recognized in any mixture of case.  Multiple consecutive ATOMs are
-accepted at the leaf level; these are reported as an OR combination of
-the individual ATOMs.
+An ATOM is a string not containing whitespace or parentheses or double
+quotes, and not equal to one of the key words 'AND', 'OR', 'NOT'.  An
+ATOM can contain whitespace, parentheses and key words enclosed in
+double quotes.  The key words are recognized in any mixture of case.
+When multiple consecutive ATOMs are found at the leaf level, they are
+connected by an implied AND operator, and an unquoted leading hyphen
+is interpreted as a NOT operator.  When an ATOM contains multiple
+words (where a word is a string of letters, digits and underscore), it
+specifies a phrase search.
+
+Summarizing the default operator rules:
+
+- a sequence of words without operators implies AND, e.g. ``foo bar''
+- double-quoted text implies phrase search, e.g. ``"foo bar"''
+- words connected by punctuation implies phrase search, e.g. ``foo-bar''
+- a leading hyphen implies NOT, e.g. ``foo -bar''
+- these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar''
+
 """
 
 import re
@@ -51,6 +64,9 @@
     _RPAREN:    _RPAREN,
 }
 
+# Magical regex to tokenize.  A beauty, ain't it. :-)
+_tokenizer_regex = r'[()]|[^()\s"]*(?:"[^"]*"[^()\s"]*)+|[^()\s"]+'
+
 class QueryParser:
 
     def __init__(self):
@@ -58,7 +74,7 @@
 
     def parseQuery(self, query):
         # Lexical analysis.
-        tokens = re.findall(r"[()]|\w+", query)
+        tokens = re.findall(_tokenizer_regex, query)
         self.__tokens = tokens
         # classify tokens
         self.__tokentypes = [_keywords.get(token.upper(), _ATOM)
@@ -130,8 +146,28 @@
             atoms = [self._get(_ATOM)]
             while self._peek(_ATOM):
                 atoms.append(self._get(_ATOM))
-            if len(atoms) == 1:
-                tree = ParseTree.AtomNode(atoms[0])
+            nodes = []
+            nots = []
+            for a in atoms:
+                words = re.findall(r"\w+", a)
+                if not words:
+                    continue
+                if len(words) == 1:
+                    n = ParseTree.AtomNode(words[0])
+                else:
+                    n = ParseTree.PhraseNode(" ".join(words))
+                if a[0] == "-":
+                    n = ParseTree.NotNode(n)
+                    nots.append(n)
+                else:
+                    nodes.append(n)
+            if not nodes:
+                text = " ".join(atoms)
+                msg = "At least one positive term required: %r" % text
+                raise ParseTree.ParseError, msg
+            nodes.extend(nots)
+            if len(nodes) == 1:
+                tree = nodes[0]
             else:
-                tree = ParseTree.PhraseNode(" ".join(atoms))
+                tree = ParseTree.AndNode(nodes)
         return tree