[Checkins] SVN: topia.termextract/trunk/s - Rename KeywordExtractor to TermExtractor.
Stephan Richter
srichter at gmail.com
Sat May 30 11:42:14 EDT 2009
Log message for revision 100556:
- Rename KeywordExtractor to TermExtractor.
- Fixed up documentation to talk about terms instead of keywords.
- Added test for modal verb rule.
Changed:
U topia.termextract/trunk/setup.py
U topia.termextract/trunk/src/topia/termextract/README.txt
U topia.termextract/trunk/src/topia/termextract/example.txt
U topia.termextract/trunk/src/topia/termextract/extract.py
U topia.termextract/trunk/src/topia/termextract/interfaces.py
-=-
Modified: topia.termextract/trunk/setup.py
===================================================================
--- topia.termextract/trunk/setup.py 2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/setup.py 2009-05-30 15:42:14 UTC (rev 100556)
@@ -38,7 +38,7 @@
read('CHANGES.txt')
),
license = "ZPL 2.1",
- keywords = "pos taggerlinguistics",
+ keywords = "content term extract pos tagger linguistics",
classifiers = [
'Development Status :: 4 - Beta',
'Environment :: Web Environment',
Modified: topia.termextract/trunk/src/topia/termextract/README.txt
===================================================================
--- topia.termextract/trunk/src/topia/termextract/README.txt 2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/README.txt 2009-05-30 15:42:14 UTC (rev 100556)
@@ -1,8 +1,8 @@
-==================
-Keyword Extraction
-==================
+===============
+Term Extraction
+===============
-This package implements text keyword extraction by making use of a simple
+This package implements text term extraction by making use of a simple
Parts-Of-Speech (POS) tagging algorithm.
http://bioie.ldc.upenn.edu/wiki/index.php/Part-of-Speech
@@ -41,7 +41,7 @@
['This', 'is', 'a', 'simple', 'example', '.']
While most tokenizers ignore punctuation, it is important for us to keep it,
-since we need it later for the keyword extraction. Let's now look at some more
+since we need it later for the term extraction. Let's now look at some more
complex cases:
- Quoted Text
@@ -172,6 +172,26 @@
>>> tagger('. Stephan')
[['.', '.', '.'], ['Stephan', 'NNP', 'Stephan']]
+- Determine Verb after Modal Verb
+
+ >>> tagger('The fox can jump')
+ [['The', 'DT', 'The'],
+ ['fox', 'NN', 'fox'],
+ ['can', 'MD', 'can'],
+ ['jump', 'VB', 'jump']]
+ >>> tagger("The fox can't jump")
+ [['The', 'DT', 'The'],
+ ['fox', 'NN', 'fox'],
+ ['can', 'MD', 'can'],
+ ["'t", 'RB', "'t"],
+ ['jump', 'VB', 'jump']]
+ >>> tagger('The fox can really jump')
+ [['The', 'DT', 'The'],
+ ['fox', 'NN', 'fox'],
+ ['can', 'MD', 'can'],
+ ['really', 'RB', 'really'],
+ ['jump', 'VB', 'jump']]
+
- Normalize Plural Forms
>>> tagger('examples')
@@ -189,15 +209,15 @@
[['feet', 'NNS', 'feet']]
-Keywordword Extraction
-----------------------
+Term Extraction
+---------------
-Now that we can tag a text, let's have a look at the keyword extractions.
+Now that we can tag a text, let's have a look at the term extractions.
>>> from topia.termextract import extract
- >>> extractor = extract.KeywordExtractor()
+ >>> extractor = extract.TermExtractor()
>>> extractor
- <KeywordExtractor using <Tagger for english>>
+ <TermExtractor using <Tagger for english>>
As you can see, the extractor maintains a tagger:
@@ -207,17 +227,17 @@
When creating an extractor, you can also pass in a tagger to avoid frequent
tagger initialization:
- >>> extractor = extract.KeywordExtractor(tagger)
+ >>> extractor = extract.TermExtractor(tagger)
>>> extractor.tagger is tagger
True
-Let's get the keywords for a simple text.
+Let's get the terms for a simple text.
>>> extractor("The fox can't jump over the fox's tail.")
[]
-We got no keywords. That's because by default at least 3 occurences of a
-keyword must be detected, if the keyword consists of a single word.
+We got no terms. That's because by default at least 3 occurences of a
+term must be detected, if the term consists of a single word.
The extractor maintains a filter component. Let's register the trivial
permissive filter, which simply return everything that the extractor suggests:
@@ -233,12 +253,12 @@
>>> extractor("The fox can't jump over the fox's tail.")
[('fox', 2, 1)]
-Let's now have a look at multi-word keywords. Oftentimes multi-word nouns and
+Let's now have a look at multi-word terms. Oftentimes multi-word nouns and
proper names occur only once or twice in a text. But they are often great
-keywords! To handle this scenario, the concept of "strength" was
+terms! To handle this scenario, the concept of "strength" was
introduced. Currently the strength is simply the amount of words in the
-keyword/term. By default, all keywords with a strength larger than 1 are
-selected regardless of the number of occurances.
+term. By default, all terms with a strength larger than 1 are selected
+regardless of the number of occurances.
>>> extractor('The German consul of Boston resides in Newton.')
[('German consul', 1, 2)]
Modified: topia.termextract/trunk/src/topia/termextract/example.txt
===================================================================
--- topia.termextract/trunk/src/topia/termextract/example.txt 2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/example.txt 2009-05-30 15:42:14 UTC (rev 100556)
@@ -2,8 +2,8 @@
A News Article
==============
-This document provides a simple example of extracting the keywords of a BBC
-article from May 29, 2009. We will use several keyword extraction tools to
+This document provides a simple example of extracting the terms of a BBC
+article from May 29, 2009. We will use several term extraction tools to
compare the outcome.
>>> text ='''
@@ -52,7 +52,7 @@
Yahoo Keyword Extractor
-----------------------
-Yahoo provides a service that extracts keywords from a piece of content using
+Yahoo provides a service that extracts terms from a piece of content using
its immense search database.
http://developer.yahoo.com/search/content/V1/termExtraction.html
@@ -358,7 +358,7 @@
calculation,
>>> from topia.termextract import extract
- >>> extractor = extract.KeywordExtractor()
+ >>> extractor = extract.TermExtractor()
Let's look at the result of the tagger first:
Modified: topia.termextract/trunk/src/topia/termextract/extract.py
===================================================================
--- topia.termextract/trunk/src/topia/termextract/extract.py 2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/extract.py 2009-05-30 15:42:14 UTC (rev 100556)
@@ -11,7 +11,7 @@
# FOR A PARTICULAR PURPOSE.
#
##############################################################################
-"""POS Tagger
+"""Term Extractor
$Id$
"""
@@ -35,13 +35,13 @@
return ((strength == 1 and occur >= self.singleStrengthMinOccur) or
(strength >= self.noLimitStrength))
-def _add(term, norm, keyword, keywords):
- keyword.append((term, norm))
- keywords.setdefault(norm, 0)
- keywords[norm] += 1
+def _add(term, norm, multiterm, terms):
+ multiterm.append((term, norm))
+ terms.setdefault(norm, 0)
+ terms[norm] += 1
-class KeywordExtractor(object):
- zope.interface.implements(interfaces.IKeywordExtractor)
+class TermExtractor(object):
+ zope.interface.implements(interfaces.ITermExtractor)
def __init__(self, tagger=None, filter=None):
if tagger is None:
@@ -52,41 +52,41 @@
filter = DefaultFilter()
self.filter = filter
- def extract(self, terms):
- """See interfaces.IKeywordExtractor"""
- keywords = {}
+ def extract(self, taggedTerms):
+ """See interfaces.ITermExtractor"""
+ terms = {}
# Phase 1: A little state machine is used to build simple and
- # composite keywords.
- keyword = []
+ # composite terms.
+ multiterm = []
state = SEARCH
- while terms:
- term, tag, norm = terms.pop(0)
+ while taggedTerms:
+ term, tag, norm = taggedTerms.pop(0)
if state == SEARCH and tag.startswith('N'):
state = NOUN
- _add(term, norm, keyword, keywords)
+ _add(term, norm, multiterm, terms)
elif state == SEARCH and tag == 'JJ' and term[0].isupper():
state = NOUN
- _add(term, norm, keyword, keywords)
+ _add(term, norm, multiterm, terms)
elif state == NOUN and tag.startswith('N'):
- _add(term, norm, keyword, keywords)
+ _add(term, norm, multiterm, terms)
elif state == NOUN and tag == 'JJ' and term[0].isupper():
- _add(term, norm, keyword, keywords)
+ _add(term, norm, multiterm, terms)
elif state == NOUN and not tag.startswith('N'):
state = SEARCH
- if len(keyword) > 1:
- word = ' '.join([word for word, norm in keyword])
- keywords.setdefault(word, 0)
- keywords[word] += 1
- keyword = []
- # Phase 2: Only select the keywords that fulfill the filter criteria.
- # Also create the keyword strength.
+ if len(multiterm) > 1:
+ word = ' '.join([word for word, norm in multiterm])
+ terms.setdefault(word, 0)
+ terms[word] += 1
+ multiterm = []
+ # Phase 2: Only select the terms that fulfill the filter criteria.
+ # Also create the term strength.
return [
(word, occur, len(word.split()))
- for word, occur in keywords.items()
+ for word, occur in terms.items()
if self.filter(word, occur, len(word.split()))]
def __call__(self, text):
- """See interfaces.IKeywordExtractor"""
+ """See interfaces.ITermExtractor"""
terms = self.tagger(text)
return self.extract(terms)
Modified: topia.termextract/trunk/src/topia/termextract/interfaces.py
===================================================================
--- topia.termextract/trunk/src/topia/termextract/interfaces.py 2009-05-30 15:26:12 UTC (rev 100555)
+++ topia.termextract/trunk/src/topia/termextract/interfaces.py 2009-05-30 15:42:14 UTC (rev 100556)
@@ -44,9 +44,9 @@
"""Get a tagged list of words."""
-class IKeywordExtractor(zope.interface.Interface):
- """Extract important keywords from a given text."""
+class ITermExtractor(zope.interface.Interface):
+ """Extract important terms from a given text."""
def __call__(text):
- """Returns a list of extracted keywords, the amount of occurences and
+ """Returns a list of extracted terms, the amount of occurences and
their search strength."""
More information about the Checkins
mailing list