[Zope-dev] cataloging binary files (pdf's, word docs...)

Martijn Pieters mj@digicool.com
Fri, 18 Feb 2000 04:06:06 -0500


From: Roman Milner [mailto:roman@speeder.com]
> 
> I'm trying to come up with a way to catalog PDF's and Word docs.  It
> is  easy to write python methods to pull the text ouf of these.  The
> problem is that we already have tons of them in our ZODB as file
> objects.
> 
> The only thing I can think of is to make a zclass class for each type
> (ie. PDFFile type) that has a method that knows how to 
> extract the text
> from the pdf and have zcatalog catalog that property.  But this means
> re-creating all the binary files currenlty in our ZODB.
> 
> Can any one offer any better suggestions?  I could write a python
> method that extracted the text based on mime type but I can't go back
> and ad that method to each file object.
> 
> Thanks for any help.
> 

You could write an External Method, then acquire that method onto the
File object. Let's call it FileToText:

  def FileToText(self):
      # do watherever you want with self,
      # it is the File object.
      # return some text.

Then ad a textindex on FileToText, and you can start cataloguing your
binary File objects.

-- 
Martijn Pieters, Software Engineer 
| Digital Creations http://www.digicool.com 
| Creators of Zope      http://www.zope.org 
| mailto:mj@digicool.com       ICQ: 4532236
| PGP:
http://wwwkeys.nl.pgp.net:11371/pks/lookup?op=get&search=0xA8A32149 
-------------------------------------------