[Zope] attribute used to index PDFs?
Garth B.
garthb at gmail.com
Mon Dec 12 14:54:09 EST 2005
On closer inspection, the Word docs aren't actually being indexed
appropriately either. When I browse the vocabulary for these indexed
Word docs, I happen to see textual content that can be seen by also
cat'ing the document to the stdout. The vocab includes other strings
that certainly are not content. I guess they're string
representations of binary content.
These are other things that I noticed, maybe they won't amount to anything:
- When I watch the processes during indexing w/top I don't see wvWare
or pdftotext appear. Maybe they won't.
- I also inserted a couple of LOG.warn's in src/textindexng/content.py
around line 130 ( if d.has_key('mimetype'): ), and this test always
fails, thereby skipping conversion.
- Digging further in this file, "mimetype" is only defined when
extract_content() in content.py calls "icc.addBinary(...)". This only
happens when the indexed object provides a txng_get() hook (or I
suppose if an adapter exists). That whole block (around lines 81 -
93) never gets hit with my PDFs or Word docs during indexing. When I
index a large number of PDFs I will get a number of TypeErrors raised
around line 110 when extract_content() notices that the data isn't a
[unicode] string.
Is the standard Zope File object supposed to expose a txng_get hook?
On 12/12/05, Garth B. <garthb at gmail.com> wrote:
> Hi Andreas,
>
> Neither PrincipiaSearchSource nor SearchableText does anything for
> these File-type objects. I guess nothing for SearchableText is
> expected since these are not CMF or Plone-derived objects. The only
> way I've managed to get *anything* indexed for these File-type objects
> is by specifying the "data" attribute.
>
> A couple of related postings that I've found through a bit of Googling
> have also noted having to use "data" when indexing these kinds of
> files, for example:
> http://mail.zope.org/pipermail/zope/2003-August/139702.html
>
> So, I should be able to use PrincipiaSearchSource? I've only used
> that for text-oriented objects like Page Templates. I'll keep digging
> around, but I welcome any suggestions for what the problem could be or
> how I can debug this further.
>
> Garth
>
> On 12/12/05, Andreas Jung <lists at andreas-jung.com> wrote:
> >
> >
> > --On 12. Dezember 2005 11:33:13 -0500 "Garth B." <garthb at gmail.com> wrote:
> >
> > > TextIndexNG 3.1.1
> > > Zope 2.8.0
> > > Python 2.3.5
> > >
> > > What attribute should be specified when indexing PDFs? I've been
> > > using "data". Word docs are indexed properly, but the PDFs aren't.
> > > The PDFs are still found with the rest of the files, but the indexed
> > > content is not what I expected.
> >
> > Depends on the content-type. PrincipiaSearchSource for core Zope types as
> > File, DTML and SearchableText for any CMF or Plone content-type.
> >
> > -aj
> >
> >
>
More information about the Zope
mailing list