Hi, I need to be able to index the text within pdf files. I assume I will somehow use PrincipiaSearchSource, but I need to know how to get the text out of the pdf when it is uploaded to the ZODB. Has anyone done this before? Are there any packages around that I can use that run in python or at least on a linux box that I can pipe to and from? terry -- Terry Kerr (terry@adroit.net) Adroit Internet Solutions Pty Ltd (www.adroit.net) Phone: +613 9563 4461 Fax: +613 9563 3856 Mobile: +61 414 938 124 ICQ: 79303381
Terry Kerr wrote:
Hi,
I need to be able to index the text within pdf files. I assume I will somehow use PrincipiaSearchSource, but I need to know how to get the text out of the pdf when it is uploaded to the ZODB. Has anyone done this before? Are there any packages around that I can use that run in python or at least on a linux box that I can pipe to and from?
terry
from xml2pdf there are a multitude of ways in python XSLT - check out the ibm.com/developer xmlzone they have an article in the education lib for transforming xml to pdf. platypus packages from http://www.reportlab.com/ they might give you some help in going the other way.. as for implementation... looking at a pdf in a text viewer it appears to be formating text and encoded display strings. you could write a subclass of file, which read its content upon upload stripping the formatting string and decoding the display strings and storing that as a property to be indexed. Kapil
I just answered my own question. The program is pdftotext, part of the xpdf package available for unix machines. It is very cool and very fast. terry Kapil Thangavelu wrote:
Terry Kerr wrote:
Hi,
I need to be able to index the text within pdf files. I assume I will somehow use PrincipiaSearchSource, but I need to know how to get the text out of the pdf when it is uploaded to the ZODB. Has anyone done this before? Are there any packages around that I can use that run in python or at least on a linux box that I can pipe to and from?
terry
from xml2pdf there are a multitude of ways in python
XSLT - check out the ibm.com/developer xmlzone they have an article in the education lib for transforming xml to pdf.
platypus packages from http://www.reportlab.com/
they might give you some help in going the other way..
as for implementation...
looking at a pdf in a text viewer it appears to be formating text and encoded display strings.
you could write a subclass of file, which read its content upon upload stripping the formatting string and decoding the display strings and storing that as a property to be indexed.
Kapil
-- Terry Kerr (terry@adroit.net) Adroit Internet Solutions Pty Ltd (www.adroit.net) Phone: +613 9563 4461 Fax: +613 9563 3856 Mobile: +61 414 938 124 ICQ: 79303381
participants (2)
-
Kapil Thangavelu -
Terry Kerr