[Zope] indexing pdf files
Kapil Thangavelu
kthangavelu@earthlink.net
Thu, 31 Aug 2000 17:51:24 -0700
Terry Kerr wrote:
>
> Hi,
>
> I need to be able to index the text within pdf files. I assume I will
> somehow use PrincipiaSearchSource, but I need to know how to get the
> text out of the pdf when it is uploaded to the ZODB. Has anyone done
> this before? Are there any packages around that I can use that run in
> python or at least on a linux box that I can pipe to and from?
>
> terry
>
from xml2pdf there are a multitude of ways in python
XSLT - check out the ibm.com/developer xmlzone they have an article in
the education lib for transforming xml to pdf.
platypus packages from
http://www.reportlab.com/
they might give you some help in going the other way..
as for implementation...
looking at a pdf in a text viewer it appears to be formating text and
encoded display strings.
you could write a subclass of file, which read its content upon upload
stripping the formatting string and decoding the display strings and
storing that as a property to be indexed.
Kapil