[Zope-CVS] CVS: Products/ZCTextIndex - Index.py:1.1.2.10

Jeremy Hylton jeremy@zope.com
Thu, 2 May 2002 22:59:59 -0400


Update of /cvs-repository/Products/ZCTextIndex
In directory cvs.zope.org:/tmp/cvs-serv8005

Modified Files:
      Tag: TextIndexDS9-branch
	Index.py 
Log Message:
Correct implementation of the cosine() function from Managing Gigabytes.

Store results and scaled ints instead of floats, using function
scaled_int() that returns a float * 256.  I'm guessing this is
sufficient precision for ranking, because we're going to return a
percentage relevance based on the cosine.

The comment block explains the calculations in modest detail.

     # Most of the computation for computing a relevance score for the
     # document occurs in the search() method.  The code currently
     # implements the cosine similarity function described in Managing
     # Gigabytes, eq. 4.3, p. 187.  The index_object() method
     # precomputes some values that are independent of the particular
     # query.
 
     # The equation is:
     # sum = 1/W(d) * 1/W(q) + sum(for t in Q^D: w(d,t) * w(q,t))
     # where w(d, t) = 1 + log f(d, t)
     #       w(q, t) = log(1 + N/f(t))
     #       W(d) = sqrt(sum(for t in D: w(d, t) ** 2))
     #       W(q) = sqrt(sum(for t in Q: w(q, t) ** 2))

The function doc_term_weight() computes w(d, t).
The function query_term_weight() computes w(q, t).
The method query_weight() computes W(q).
The attribute self._docweight[d] stores W(d),
    which is computed by _get_frequencies().

Add a bunch of helper functions _get_xxx() to help test the
implementation.  One potential risk is that these _get_xxx() methods
need to stay in sync with the computations that occur in the actual
search and indexing methods, which don't call the _get_xxx() methods.

I think this is an acceptable risk because the _get_xxx() methods only
test intermediate steps in the computation of the cosine.  The unit
tests also test the final cosine; the _get_xxx() methods are primarily
helpful for investigating the details if the cosine is wrong.  It's
unlikely that the code will fail in a way that the cosine is still
computed correctly <wink>.

XXX The code could use more comments.




=== Products/ZCTextIndex/Index.py 1.1.2.9 => 1.1.2.10 ===
 from Products.ZCTextIndex.IIndex import IIndex
 
+def scaled_int(f):
+    return int(f * 256)
 
 class Index:
     __implements__ = IIndex
@@ -40,6 +42,20 @@
         # used for un-indexing
         self._docwords = IOBTree()
 
+    # Most of the computation for computing a relevance score for the
+    # document occurs in the search() method.  The code currently
+    # implements the cosine similarity function described in Managing
+    # Gigabytes, eq. 4.3, p. 187.  The index_object() method
+    # precomputes some values that are independent of the particular
+    # query.
+
+    # The equation is:
+    # sum = 1/W(d) * 1/W(q) + sum(for t in Q^D: w(d,t) * w(q,t))
+    # where w(d, t) = 1 + log f(d, t)
+    #       w(q, t) = log(1 + N/f(t))
+    #       W(d) = sqrt(sum(for t in D: w(d, t) ** 2))
+    #       W(q) = sqrt(sum(for t in Q: w(q, t) ** 2))
+
     def index_doc(self, docid, text, threshold=None):
         wids = self._lexicon.sourceToWordIds(text)
         freqs, docweight = self._get_frequencies(wids)
@@ -63,26 +79,33 @@
         N = len(self._docweight)
         for wid in wids:
             map = self._wordinfo[wid]
-            inv = invfreq(N, len(map))
-            for docid, f in map.items():
-                w = f * inv / self._docweight[docid]
-                result[docid] = result.get(docid, 0) + w
+            idf = query_term_weight(len(map), N)
+            for docid, tf in map.items():
+                w = tf * idf / self._docweight[docid]
+                result[docid] = int(result.get(docid, 0) + w)
         return result
 
     def query_weight(self, terms):
         wids = []
         for term in terms:
             wids += self._lexicon.termToWordIds(term)
-        return self._get_frequencies(wids)[1]
+        N = len(self._docweight)
+        sum = 0
+        for wid in wids:
+            wt = math.log(1 + N / len(self._wordinfo[wid]))
+            sum += wt ** 2
+        return scaled_int(math.sqrt(sum))
 
     def _get_frequencies(self, wids):
+        """Return individual doc-term weights and docweight."""
+        # computes w(d, t) for each term and W(d)
         d = {}
         for wid in wids:
             d[wid] = d.get(wid, 0) + 1
         Wsquares = 0
         freqs = []
         for wid, count in d.items():
-            f = frequency(count)
+            f = doc_term_weight(count)
             Wsquares += f ** 2
             freqs.append((wid, f))
         return freqs, int(math.sqrt(Wsquares))
@@ -104,9 +127,35 @@
         del map[docid]
         self._wordinfo[wid] = map
 
-def frequency(count):
-    # XXX the logarithm calculations need to be added here
-    return count
+    # some helper method included to support unit tests
+    def _get_wdt(self, d, t):
+        wid, = self._lexicon.termToWordIds(t)
+        map = self._wordinfo[wid]
+        return map.get(d, 0)
+
+    def _get_Wd(self, d):
+        return self._docweight[d]
+
+    def _get_ft(self, t):
+        wid, = self._lexicon.termToWordIds(t)
+        return len(self._wordinfo[wid])
+
+    def _get_wt(self, t):
+        wid, = self._lexicon.termToWordIds(t)
+        map = self._wordinfo[wid]
+        return scaled_int(math.log(1 + len(self._docweight) / float(len(map))))
+
+def doc_term_weight(count):
+    """Return the doc-term weight for a term that appears count times."""
+    # implements w(d, t) = 1 + log f(d, t)
+    return scaled_int(1 + math.log(count))
+
+def query_term_weight(term_count, num_terms):
+    """Return the query-term weight for a term,
+
+    that appears term_count times in a collection with num_terms
+    unique terms.
+    """
+    # implements w(q, t) = log(1 + N/f(t))
+    return scaled_int(math.log(1 + float(num_terms) / term_count))
 
-def invfreq(N, ft):
-    return 1 + (N / ft)