[Checkins] SVN: topia.termextract/trunk/s - Rename KeywordExtractor to TermExtractor.

Sat May 30 11:42:14 EDT 2009

Log message for revision 100556:
  - Rename KeywordExtractor to TermExtractor.
  
  - Fixed up documentation to talk about terms instead of keywords.
  
  - Added test for modal verb rule.
  

Changed:
  U   topia.termextract/trunk/setup.py
  U   topia.termextract/trunk/src/topia/termextract/README.txt
  U   topia.termextract/trunk/src/topia/termextract/example.txt
  U   topia.termextract/trunk/src/topia/termextract/extract.py
  U   topia.termextract/trunk/src/topia/termextract/interfaces.py

-=-
Modified: topia.termextract/trunk/setup.py
===================================================================

--- topia.termextract/trunk/setup.py	2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/setup.py	2009-05-30 15:42:14 UTC (rev 100556)
@@ -38,7 +38,7 @@
         read('CHANGES.txt')
         ),
     license = "ZPL 2.1",
-    keywords = "pos taggerlinguistics",
+    keywords = "content term extract pos tagger linguistics",
     classifiers = [
         'Development Status :: 4 - Beta',
         'Environment :: Web Environment',

Modified: topia.termextract/trunk/src/topia/termextract/README.txt
===================================================================
--- topia.termextract/trunk/src/topia/termextract/README.txt	2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/README.txt	2009-05-30 15:42:14 UTC (rev 100556)
@@ -1,8 +1,8 @@
-==================
-Keyword Extraction
-==================
+===============
+Term Extraction
+===============
 
-This package implements text keyword extraction by making use of a simple
+This package implements text term extraction by making use of a simple
 Parts-Of-Speech (POS) tagging algorithm.
 
 http://bioie.ldc.upenn.edu/wiki/index.php/Part-of-Speech
@@ -41,7 +41,7 @@
   ['This', 'is', 'a', 'simple', 'example', '.']
 
 While most tokenizers ignore punctuation, it is important for us to keep it,
-since we need it later for the keyword extraction. Let's now look at some more
+since we need it later for the term extraction. Let's now look at some more
 complex cases:
 
 - Quoted Text
@@ -172,6 +172,26 @@
     >>> tagger('. Stephan')
     [['.', '.', '.'], ['Stephan', 'NNP', 'Stephan']]
 
+- Determine Verb after Modal Verb
+
+    >>> tagger('The fox can jump')
+    [['The', 'DT', 'The'],
+     ['fox', 'NN', 'fox'],
+     ['can', 'MD', 'can'],
+     ['jump', 'VB', 'jump']]
+    >>> tagger("The fox can't jump")
+    [['The', 'DT', 'The'],
+     ['fox', 'NN', 'fox'],
+     ['can', 'MD', 'can'],
+     ["'t", 'RB', "'t"],
+     ['jump', 'VB', 'jump']]
+    >>> tagger('The fox can really jump')
+    [['The', 'DT', 'The'],
+     ['fox', 'NN', 'fox'],
+     ['can', 'MD', 'can'],
+     ['really', 'RB', 'really'],
+     ['jump', 'VB', 'jump']]
+
 - Normalize Plural Forms
 
     >>> tagger('examples')
@@ -189,15 +209,15 @@
     [['feet', 'NNS', 'feet']]
 
 
-Keywordword Extraction
-----------------------
+Term Extraction
+---------------
 
-Now that we can tag a text, let's have a look at the keyword extractions.
+Now that we can tag a text, let's have a look at the term extractions.
 
   >>> from topia.termextract import extract
-  >>> extractor = extract.KeywordExtractor()
+  >>> extractor = extract.TermExtractor()
   >>> extractor
-  <KeywordExtractor using <Tagger for english>>
+  <TermExtractor using <Tagger for english>>
 
 As you can see, the extractor maintains a tagger:
 
@@ -207,17 +227,17 @@
 When creating an extractor, you can also pass in a tagger to avoid frequent
 tagger initialization:
 
-  >>> extractor = extract.KeywordExtractor(tagger)
+  >>> extractor = extract.TermExtractor(tagger)
   >>> extractor.tagger is tagger
   True
 
-Let's get the keywords for a simple text.
+Let's get the terms for a simple text.
 
   >>> extractor("The fox can't jump over the fox's tail.")
   []
 
-We got no keywords. That's because by default at least 3 occurences of a
-keyword must be detected, if the keyword consists of a single word.
+We got no terms. That's because by default at least 3 occurences of a
+term must be detected, if the term consists of a single word.
 
 The extractor maintains a filter component. Let's register the trivial
 permissive filter, which simply return everything that the extractor suggests:
@@ -233,12 +253,12 @@
   >>> extractor("The fox can't jump over the fox's tail.")
   [('fox', 2, 1)]
 
-Let's now have a look at multi-word keywords. Oftentimes multi-word nouns and
+Let's now have a look at multi-word terms. Oftentimes multi-word nouns and
 proper names occur only once or twice in a text. But they are often great
-keywords! To handle this scenario, the concept of "strength" was
+terms! To handle this scenario, the concept of "strength" was
 introduced. Currently the strength is simply the amount of words in the
-keyword/term. By default, all keywords with a strength larger than 1 are
-selected regardless of the number of occurances.
+term. By default, all terms with a strength larger than 1 are selected
+regardless of the number of occurances.
 
   >>> extractor('The German consul of Boston resides in Newton.')
   [('German consul', 1, 2)]

Modified: topia.termextract/trunk/src/topia/termextract/example.txt
===================================================================
--- topia.termextract/trunk/src/topia/termextract/example.txt	2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/example.txt	2009-05-30 15:42:14 UTC (rev 100556)
@@ -2,8 +2,8 @@
 A News Article
 ==============
 
-This document provides a simple example of extracting the keywords of a BBC
-article from May 29, 2009. We will use several keyword extraction tools to
+This document provides a simple example of extracting the terms of a BBC
+article from May 29, 2009. We will use several term extraction tools to
 compare the outcome.
 
   >>> text ='''
@@ -52,7 +52,7 @@
 Yahoo Keyword Extractor
 -----------------------
 
-Yahoo provides a service that extracts keywords from a piece of content using
+Yahoo provides a service that extracts terms from a piece of content using
 its immense search database.
 
 http://developer.yahoo.com/search/content/V1/termExtraction.html
@@ -358,7 +358,7 @@
 calculation,
 
   >>> from topia.termextract import extract
-  >>> extractor = extract.KeywordExtractor()
+  >>> extractor = extract.TermExtractor()
 
 Let's look at the result of the tagger first:
 

Modified: topia.termextract/trunk/src/topia/termextract/extract.py
===================================================================
--- topia.termextract/trunk/src/topia/termextract/extract.py	2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/extract.py	2009-05-30 15:42:14 UTC (rev 100556)
@@ -11,7 +11,7 @@
 # FOR A PARTICULAR PURPOSE.
 #
 ##############################################################################
-"""POS Tagger
+"""Term Extractor
 
 $Id$
 """
@@ -35,13 +35,13 @@
         return ((strength == 1 and occur >= self.singleStrengthMinOccur) or
                 (strength >= self.noLimitStrength))
 
-def _add(term, norm, keyword, keywords):
-    keyword.append((term, norm))
-    keywords.setdefault(norm, 0)
-    keywords[norm] += 1
+def _add(term, norm, multiterm, terms):
+    multiterm.append((term, norm))
+    terms.setdefault(norm, 0)
+    terms[norm] += 1
 
-class KeywordExtractor(object):
-    zope.interface.implements(interfaces.IKeywordExtractor)
+class TermExtractor(object):
+    zope.interface.implements(interfaces.ITermExtractor)
 
     def __init__(self, tagger=None, filter=None):
         if tagger is None:
@@ -52,41 +52,41 @@
             filter = DefaultFilter()
         self.filter = filter
 
-    def extract(self, terms):
-        """See interfaces.IKeywordExtractor"""
-        keywords = {}
+    def extract(self, taggedTerms):
+        """See interfaces.ITermExtractor"""
+        terms = {}
         # Phase 1: A little state machine is used to build simple and
-        # composite keywords.
-        keyword = []
+        # composite terms.
+        multiterm = []
         state = SEARCH
-        while terms:
-            term, tag, norm = terms.pop(0)
+        while taggedTerms:
+            term, tag, norm = taggedTerms.pop(0)
             if state == SEARCH and tag.startswith('N'):
                 state = NOUN
-                _add(term, norm, keyword, keywords)
+                _add(term, norm, multiterm, terms)
             elif state == SEARCH and tag == 'JJ' and term[0].isupper():
                 state = NOUN
-                _add(term, norm, keyword, keywords)
+                _add(term, norm, multiterm, terms)
             elif state == NOUN and tag.startswith('N'):
-                _add(term, norm, keyword, keywords)
+                _add(term, norm, multiterm, terms)
             elif state == NOUN and tag == 'JJ' and term[0].isupper():
-                _add(term, norm, keyword, keywords)
+                _add(term, norm, multiterm, terms)
             elif state == NOUN and not tag.startswith('N'):
                 state = SEARCH
-                if len(keyword) > 1:
-                    word = ' '.join([word for word, norm in keyword])
-                    keywords.setdefault(word, 0)
-                    keywords[word] += 1
-                keyword = []
-        # Phase 2: Only select the keywords that fulfill the filter criteria.
-        # Also create the keyword strength.
+                if len(multiterm) > 1:
+                    word = ' '.join([word for word, norm in multiterm])
+                    terms.setdefault(word, 0)
+                    terms[word] += 1
+                multiterm = []
+        # Phase 2: Only select the terms that fulfill the filter criteria.
+        # Also create the term strength.
         return [
             (word, occur, len(word.split()))
-            for word, occur in keywords.items()
+            for word, occur in terms.items()
             if self.filter(word, occur, len(word.split()))]
 
     def __call__(self, text):
-        """See interfaces.IKeywordExtractor"""
+        """See interfaces.ITermExtractor"""
         terms = self.tagger(text)
         return self.extract(terms)
 

Modified: topia.termextract/trunk/src/topia/termextract/interfaces.py
===================================================================
--- topia.termextract/trunk/src/topia/termextract/interfaces.py	2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/interfaces.py	2009-05-30 15:42:14 UTC (rev 100556)
@@ -44,9 +44,9 @@
         """Get a tagged list of words."""
 
 
-class IKeywordExtractor(zope.interface.Interface):
-    """Extract important keywords from a given text."""
+class ITermExtractor(zope.interface.Interface):
+    """Extract important terms from a given text."""
 
     def __call__(text):
-        """Returns a list of extracted keywords, the amount of occurences and
+        """Returns a list of extracted terms, the amount of occurences and
         their search strength."""