[Zope-CVS] CVS: Products/ZCTextIndex - Index.py:1.1.2.10
Jeremy Hylton
jeremy@zope.com
Thu, 2 May 2002 22:59:59 -0400
Update of /cvs-repository/Products/ZCTextIndex
In directory cvs.zope.org:/tmp/cvs-serv8005
Modified Files:
Tag: TextIndexDS9-branch
Index.py
Log Message:
Correct implementation of the cosine() function from Managing Gigabytes.
Store results and scaled ints instead of floats, using function
scaled_int() that returns a float * 256. I'm guessing this is
sufficient precision for ranking, because we're going to return a
percentage relevance based on the cosine.
The comment block explains the calculations in modest detail.
# Most of the computation for computing a relevance score for the
# document occurs in the search() method. The code currently
# implements the cosine similarity function described in Managing
# Gigabytes, eq. 4.3, p. 187. The index_object() method
# precomputes some values that are independent of the particular
# query.
# The equation is:
# sum = 1/W(d) * 1/W(q) + sum(for t in Q^D: w(d,t) * w(q,t))
# where w(d, t) = 1 + log f(d, t)
# w(q, t) = log(1 + N/f(t))
# W(d) = sqrt(sum(for t in D: w(d, t) ** 2))
# W(q) = sqrt(sum(for t in Q: w(q, t) ** 2))
The function doc_term_weight() computes w(d, t).
The function query_term_weight() computes w(q, t).
The method query_weight() computes W(q).
The attribute self._docweight[d] stores W(d),
which is computed by _get_frequencies().
Add a bunch of helper functions _get_xxx() to help test the
implementation. One potential risk is that these _get_xxx() methods
need to stay in sync with the computations that occur in the actual
search and indexing methods, which don't call the _get_xxx() methods.
I think this is an acceptable risk because the _get_xxx() methods only
test intermediate steps in the computation of the cosine. The unit
tests also test the final cosine; the _get_xxx() methods are primarily
helpful for investigating the details if the cosine is wrong. It's
unlikely that the code will fail in a way that the cosine is still
computed correctly <wink>.
XXX The code could use more comments.
=== Products/ZCTextIndex/Index.py 1.1.2.9 => 1.1.2.10 ===
from Products.ZCTextIndex.IIndex import IIndex
+def scaled_int(f):
+ return int(f * 256)
class Index:
__implements__ = IIndex
@@ -40,6 +42,20 @@
# used for un-indexing
self._docwords = IOBTree()
+ # Most of the computation for computing a relevance score for the
+ # document occurs in the search() method. The code currently
+ # implements the cosine similarity function described in Managing
+ # Gigabytes, eq. 4.3, p. 187. The index_object() method
+ # precomputes some values that are independent of the particular
+ # query.
+
+ # The equation is:
+ # sum = 1/W(d) * 1/W(q) + sum(for t in Q^D: w(d,t) * w(q,t))
+ # where w(d, t) = 1 + log f(d, t)
+ # w(q, t) = log(1 + N/f(t))
+ # W(d) = sqrt(sum(for t in D: w(d, t) ** 2))
+ # W(q) = sqrt(sum(for t in Q: w(q, t) ** 2))
+
def index_doc(self, docid, text, threshold=None):
wids = self._lexicon.sourceToWordIds(text)
freqs, docweight = self._get_frequencies(wids)
@@ -63,26 +79,33 @@
N = len(self._docweight)
for wid in wids:
map = self._wordinfo[wid]
- inv = invfreq(N, len(map))
- for docid, f in map.items():
- w = f * inv / self._docweight[docid]
- result[docid] = result.get(docid, 0) + w
+ idf = query_term_weight(len(map), N)
+ for docid, tf in map.items():
+ w = tf * idf / self._docweight[docid]
+ result[docid] = int(result.get(docid, 0) + w)
return result
def query_weight(self, terms):
wids = []
for term in terms:
wids += self._lexicon.termToWordIds(term)
- return self._get_frequencies(wids)[1]
+ N = len(self._docweight)
+ sum = 0
+ for wid in wids:
+ wt = math.log(1 + N / len(self._wordinfo[wid]))
+ sum += wt ** 2
+ return scaled_int(math.sqrt(sum))
def _get_frequencies(self, wids):
+ """Return individual doc-term weights and docweight."""
+ # computes w(d, t) for each term and W(d)
d = {}
for wid in wids:
d[wid] = d.get(wid, 0) + 1
Wsquares = 0
freqs = []
for wid, count in d.items():
- f = frequency(count)
+ f = doc_term_weight(count)
Wsquares += f ** 2
freqs.append((wid, f))
return freqs, int(math.sqrt(Wsquares))
@@ -104,9 +127,35 @@
del map[docid]
self._wordinfo[wid] = map
-def frequency(count):
- # XXX the logarithm calculations need to be added here
- return count
+ # some helper method included to support unit tests
+ def _get_wdt(self, d, t):
+ wid, = self._lexicon.termToWordIds(t)
+ map = self._wordinfo[wid]
+ return map.get(d, 0)
+
+ def _get_Wd(self, d):
+ return self._docweight[d]
+
+ def _get_ft(self, t):
+ wid, = self._lexicon.termToWordIds(t)
+ return len(self._wordinfo[wid])
+
+ def _get_wt(self, t):
+ wid, = self._lexicon.termToWordIds(t)
+ map = self._wordinfo[wid]
+ return scaled_int(math.log(1 + len(self._docweight) / float(len(map))))
+
+def doc_term_weight(count):
+ """Return the doc-term weight for a term that appears count times."""
+ # implements w(d, t) = 1 + log f(d, t)
+ return scaled_int(1 + math.log(count))
+
+def query_term_weight(term_count, num_terms):
+ """Return the query-term weight for a term,
+
+ that appears term_count times in a collection with num_terms
+ unique terms.
+ """
+ # implements w(q, t) = log(1 + N/f(t))
+ return scaled_int(math.log(1 + float(num_terms) / term_count))
-def invfreq(N, ft):
- return 1 + (N / ft)