Re: [Zope] attribute used to index PDFs?

12 Dec 2005

      On closer inspection, the Word docs aren't actually being indexed
appropriately either.  When I browse the vocabulary for these indexed
Word docs, I happen to see textual content that can be seen by also
cat'ing the document to the stdout.  The vocab includes other strings
that certainly are not content.  I guess they're string
representations of binary content.

These are other things that I noticed, maybe they won't amount to anything:

- When I watch the processes during indexing w/top I don't see wvWare
or pdftotext appear.  Maybe they won't.

- I also inserted a couple of LOG.warn's in src/textindexng/content.py
around line 130 (  if d.has_key('mimetype'):  ), and this test always
fails, thereby skipping conversion.

- Digging further in this file, "mimetype" is only defined when
extract_content() in content.py calls "icc.addBinary(...)".  This only
happens when the indexed object provides a txng_get() hook (or I
suppose if an adapter exists).  That whole block (around lines 81 -
93) never gets hit with my PDFs or Word docs during indexing.  When I
index a large number of PDFs I will get a number of TypeErrors raised
around line 110 when extract_content() notices that the data isn't a
[unicode] string.

Is the standard Zope File object supposed to expose a txng_get hook?

On 12/12/05, Garth B. <garthb@gmail.com> wrote:
...
Hi Andreas,
Neither PrincipiaSearchSource nor SearchableText does anything for
these File-type objects.  I guess nothing for SearchableText is
expected since these are not CMF or Plone-derived objects.  The only
way I've managed to get *anything* indexed for these File-type objects
is by specifying the "data" attribute.
A couple of related postings that I've found through a bit of Googling
have also noted having to use "data" when indexing these kinds of
files, for example:
http://mail.zope.org/pipermail/zope/2003-August/139702.html
So, I should be able to use PrincipiaSearchSource?  I've only used
that for text-oriented objects like Page Templates.  I'll keep digging
around, but I welcome any suggestions for what the problem could be or
how I can debug this further.
Garth
On 12/12/05, Andreas Jung <lists@andreas-jung.com> wrote:
...
--On 12. Dezember 2005 11:33:13 -0500 "Garth B." <garthb@gmail.com> wrote:
...
TextIndexNG 3.1.1
Zope 2.8.0
Python 2.3.5
What attribute should be specified when indexing PDFs?  I've been
using "data".  Word docs are indexed properly, but the PDFs aren't.
The PDFs are still found with the rest of the files, but the indexed
content is not what I expected.
Depends on the content-type. PrincipiaSearchSource for core Zope  types as
File, DTML and SearchableText for any CMF or Plone content-type.
-aj