[Zope-CVS] CVS: Products/ZCTextIndex - QueryParser.py:1.1.2.15
Guido van Rossum
guido@python.org
Mon, 13 May 2002 00:53:04 -0400
Update of /cvs-repository/Products/ZCTextIndex
In directory cvs.zope.org:/tmp/cvs-serv17022
Modified Files:
Tag: TextIndexDS9-branch
QueryParser.py
Log Message:
Change the query syntax to be more Google-like.
The AND, OR and NOT operators and parentheses are still recognized,
but in addition:
- a sequence of words without operators implies AND, e.g. ``foo bar''
- double-quoted text implies phrase search, e.g. ``"foo bar"''
- words connected by punctuation implies phrase search, e.g. ``foo-bar''
- a leading hyphen implies NOT, e.g. ``foo -bar''
- these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar''
=== Products/ZCTextIndex/QueryParser.py 1.1.2.14 => 1.1.2.15 ===
Term = '(' OrExpr ')' | ATOM+
-An ATOM is a string not containing whitespace or parentheses, and not
-equal to one of the key words 'AND', 'OR', 'NOT'. The key words are
-recognized in any mixture of case. Multiple consecutive ATOMs are
-accepted at the leaf level; these are reported as an OR combination of
-the individual ATOMs.
+An ATOM is a string not containing whitespace or parentheses or double
+quotes, and not equal to one of the key words 'AND', 'OR', 'NOT'. An
+ATOM can contain whitespace, parentheses and key words enclosed in
+double quotes. The key words are recognized in any mixture of case.
+When multiple consecutive ATOMs are found at the leaf level, they are
+connected by an implied AND operator, and an unquoted leading hyphen
+is interpreted as a NOT operator. When an ATOM contains multiple
+words (where a word is a string of letters, digits and underscore), it
+specifies a phrase search.
+
+Summarizing the default operator rules:
+
+- a sequence of words without operators implies AND, e.g. ``foo bar''
+- double-quoted text implies phrase search, e.g. ``"foo bar"''
+- words connected by punctuation implies phrase search, e.g. ``foo-bar''
+- a leading hyphen implies NOT, e.g. ``foo -bar''
+- these can be combined, e.g. ``foo -"foo bar"'' or ``foo -foo-bar''
+
"""
import re
@@ -51,6 +64,9 @@
_RPAREN: _RPAREN,
}
+# Magical regex to tokenize. A beauty, ain't it. :-)
+_tokenizer_regex = r'[()]|[^()\s"]*(?:"[^"]*"[^()\s"]*)+|[^()\s"]+'
+
class QueryParser:
def __init__(self):
@@ -58,7 +74,7 @@
def parseQuery(self, query):
# Lexical analysis.
- tokens = re.findall(r"[()]|\w+", query)
+ tokens = re.findall(_tokenizer_regex, query)
self.__tokens = tokens
# classify tokens
self.__tokentypes = [_keywords.get(token.upper(), _ATOM)
@@ -130,8 +146,28 @@
atoms = [self._get(_ATOM)]
while self._peek(_ATOM):
atoms.append(self._get(_ATOM))
- if len(atoms) == 1:
- tree = ParseTree.AtomNode(atoms[0])
+ nodes = []
+ nots = []
+ for a in atoms:
+ words = re.findall(r"\w+", a)
+ if not words:
+ continue
+ if len(words) == 1:
+ n = ParseTree.AtomNode(words[0])
+ else:
+ n = ParseTree.PhraseNode(" ".join(words))
+ if a[0] == "-":
+ n = ParseTree.NotNode(n)
+ nots.append(n)
+ else:
+ nodes.append(n)
+ if not nodes:
+ text = " ".join(atoms)
+ msg = "At least one positive term required: %r" % text
+ raise ParseTree.ParseError, msg
+ nodes.extend(nots)
+ if len(nodes) == 1:
+ tree = nodes[0]
else:
- tree = ParseTree.PhraseNode(" ".join(atoms))
+ tree = ParseTree.AndNode(nodes)
return tree