[Zope] indexing pdf files

Terry Kerr terry@adroit.net
Fri, 01 Sep 2000 05:04:14 +1100


I just answered my own question.

The program is pdftotext, part of the xpdf package available for unix
machines.

It is very cool and very fast.

terry


Kapil Thangavelu wrote:

> Terry Kerr wrote:
> >
> > Hi,
> >
> > I need to be able to index the text within pdf files.  I assume I will
> > somehow use PrincipiaSearchSource, but I need to know how to get the
> > text out of the pdf when it is uploaded to the ZODB.  Has anyone done
> > this before?  Are there any packages around that I can use that run in
> > python or at least on a linux box that I can pipe to and from?
> >
> > terry
> >
>
> from xml2pdf there are a multitude of ways in python
>
> XSLT - check out the ibm.com/developer xmlzone they have an article in
> the education lib for transforming xml to pdf.
>
> platypus packages from
> http://www.reportlab.com/
>
> they might give you some help in going the other way..
>
> as for implementation...
>
> looking at a pdf in a text viewer it appears to be formating text and
> encoded display strings.
>
> you could write a subclass of file, which read its content upon upload
> stripping the formatting string and decoding the display strings and
> storing that as a property to be indexed.
>
> Kapil

--
Terry Kerr (terry@adroit.net)
Adroit Internet Solutions Pty Ltd (www.adroit.net)
Phone:   +613 9563 4461
Fax:     +613 9563 3856
Mobile:  +61 414 938 124
ICQ:     79303381