RE: [Zope] Advice on searching/indexing Word documents?
I really like the idea of extending OFS:File to support different file types, but what I would like to see is something that is format/filter/library agnostic. That is to say, that perhaps the way we ought to go about this is to create an API framework that upon upload filters the file with a specified filter for its mime-type. Perhaps creating a generic base class that implements a generic API for filtering a file, from which to extend by inheriting more specific classes for files of particular types or groups (fine grained to mime-type or grouped in category, eg. "Illustration"). Having such a generic framework would enable Zope to be an excellent platform for digital asset management; Suppose you had a class for all files for a particular purpose, and those files would always be of a partiaular set of mime-types, like Illustrator, PDf, or postscript. For example, if someone working at a newspaper creates a new file class instance called "DisplayAd," which is used for postscript files with embedded fonts, containing specific text, a filter set up as part of the extended class for DisplayAd file would detect the type of file, determine it was PDF, and filter out the text, and the face names of the embedded fonts. If the file was a PDF or an AI file, it would then run the appropriate filter. It might also be nice to have a extended class (inherited from file) that works for all types, and keeps some sort of configurable plugin registry of sorts, so that we can create plugin classes for specific mime-types, but only have to use one class for the objects themselves. This might be more practical. One thing that seems important: creating an API like this could allow us to write filter "plugins" in a variety of Zope supported configs, like completely in python, a python class extending a C shared library, something written in a combination of C/Lex, or the python-based plex scanner that was mentioned earlier - for that matter, even proprietary user-space binaries called via python code might be fair game... I really think that this idea has potential as a project, and would be willing to contribute. Sean -----Original Message----- From: Bjorn Stabell [mailto:bjorn@exoweb.net] Sent: Tuesday, January 02, 2001 10:07 PM To: zope@zope.org Subject: RE: [Zope] Advice on searching/indexing Word documents? This is something I've been longing for a long time. Wvare is cool, and it should also be able to access properties of many Windows (OLE) documents, not just Word documents. I've been thinking about extending the File class so that it becomes aware of the different file types and allows access to (read/write) meta data and indexing of the files' content. If we can setup a nice framework for it, I'm sure a lot of people could contribute code for specific file formats. Bye, -- Bjorn -----Original Message----- From: Jens Vagelpohl [mailto:jens@digicool.com] Posted At: Wednesday, January 03, 2001 11:28 Posted To: Zope List Conversation: [Zope] Advice on searching/indexing Word documents? Subject: Re: [Zope] Advice on searching/indexing Word documents? if you're on linux check out WVWare: http://www.wvware.com it's a C library that handles all word doc formats since 6.0 or so jens On Tue, 02 Jan 2001, Bowyer, Alex wrote:
Our company has a repository of staff CVs (Resumes) as Word Documents and I am about to embark on creating a new feature for our Zope Intranet to allow project managers to search those documents for keywords such as particular skills or projects.
I am thinking about several possibilities such as a skills/CVs database linked in via ODBC, or some task that converts the Word documents to text files which can then be searched by Zope (I think Zope can do this, and I assume it can't search Word format directly?).
Has anyone ever approached a similar problem, does anyone have any tips on how to index/search a load of documents in Zope?
Any tips/suggestions/comments would be most welcome.
Thanks,
Alex
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev ) _______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
This sounds pretty exciting. Sounds like someone should set up a proposal on dev.zope.org.I'm afraid I wouldn't be able to contribute much development right now but I'd be willing to help test and participate in discussions. --jfarr ----- Original Message ----- From: <sean.upton@uniontrib.com> To: <bjorn@exoweb.net>; <zope@zope.org> Sent: Wednesday, January 03, 2001 8:25 AM Subject: RE: [Zope] Advice on searching/indexing Word documents?
I really like the idea of extending OFS:File to support different file types, but what I would like to see is something that is format/filter/library agnostic. That is to say, that perhaps the way we ought to go about this is to create an API framework that upon upload filters the file with a specified filter for its mime-type.
[snip]
participants (2)
-
Jonothan Farr -
sean.upton@uniontrib.com