Indexing and plaintext display gives PDF errors
I'm using Zope 2.3.2 with Python 1.5.2 running on Redhat. I don't use Python, I work in DTML. I'm cataloging technical documents. I do not use Document Library or the CMF, in part because of compatibility restrictions. (The site must support NetPositive, a non-javascript, non-CSS compatible browser.) The documents I'm indexing are html, text, Word, PowerPoint, and PDF files. I have the CMF and the Document Library product installed; I also had installed wvWare, though I'm not sure I installed it correctly. (The instructions were vague.) This is my problem. When I update my Catalog, I get a number of errors on the linux box that runs my Zope installation, related to PDF files: Error (0): PDF file is damaged - attempting to construct xref table ... Error: Top level pages is wrong type (null) Error: Couldn't read page catalog Error: Couldn't find trailer dictionary Error: Couldn't read xref table These repeat a few times, giving me two screens worth, before the index updating is complete. I can think of at least one problem that might be going on here: I think some PDF documents were added as type "DocumentFile", which is related to the DocumentLibrary stuff. Anyway, I'm trying to get rid of the errors, and be able to index the text of PDF and Word files. Suggestions? I'm forwarding this to the DocumentLibrary product engineer, too. Leigh Ann -- Leigh Ann Hildebrand leighann@onebox.com - email (650) 223-2199 x2231 - voicemail/fax __________________________________________________ FREE voicemail, email, and fax...all in one place. Sign Up Now! http://www.onebox.com
Leigh Ann Hildebrand wrote:
I'm using Zope 2.3.2 with Python 1.5.2 running on Redhat. I don't use Python, I work in DTML. I'm cataloging technical documents. I do not use Document Library or the CMF, in part because of compatibility restrictions. (The site must support NetPositive, a non-javascript, non-CSS compatible browser.) The documents I'm indexing are html, text, Word, PowerPoint, and PDF files.
There isn't that much JavaScript in the DocumentLibrary Product, and it gracefully handles non-JS browsers (The only part that doesn't work is the index chooser, which uses Javascript to pass values between windows). It would not be difficult to remove the JS in the DTML methods provided by default, if necessary.
I have the CMF and the Document Library product installed; I also had installed wvWare, though I'm not sure I installed it correctly. (The instructions were vague.)
This is my problem. When I update my Catalog, I get a number of errors on the linux box that runs my Zope installation, related to PDF files:
Error (0): PDF file is damaged - attempting to construct xref table ... Error: Top level pages is wrong type (null) Error: Couldn't read page catalog Error: Couldn't find trailer dictionary Error: Couldn't read xref table
These look like errors coming from the PDF converter (pdftotext). Try running the converter on one of these file manually at the command line. IE: % pdftotext some.pdf some.txt To see if you see the same errors. If so, perhaps your version of XPDF needs updating, or it is not compatible with the files you are providing for some reason.
These repeat a few times, giving me two screens worth, before the index updating is complete. I can think of at least one problem that might be going on here: I think some PDF documents were added as type "DocumentFile", which is related to the DocumentLibrary stuff.
DocumentFile objects are like Files, except they support the conversion of PDF to text (among others) for indexing.
Anyway, I'm trying to get rid of the errors, and be able to index the text of PDF and Word files. Suggestions? I'm forwarding this to the DocumentLibrary product engineer, too.
Leigh Ann
Try testing out pdftotext and see what happens. Let me know what you find out. -- | Casey Duncan | Kaivo, Inc. | cduncan@kaivo.com `------------------>
participants (2)
-
Casey Duncan -
Leigh Ann Hildebrand