Indexing and Searching through XML files
Hi, I got lots of XML-files(about 100.000) in a directory tree on my file system. I want to publish those files using Zope/Plone. I need to be able to index them in their native format without having to upload them in the ZODB, let users search their contents through zope, and have the result displayed. Is it possible? Did anybody ever do this? Any suggestions? I'm running zope-2.6.2 and Plone-1.1. Thanks, Fab.
Under Zope it is definitely possible, under Plone I don't know as I don't have direct experience with it. I definitely did this (and much more ... ;-)) from Zope. Your mileage may vary, but this is the route I have followed with _very_ satisfactory results. 1. use localFs or alike to map yur xml doc base from the filesystem to ZooDB 2. install the standard xml python plumbing, i.e. pyXML and/or 4Suite 3. install the zope xml plumbing of your choiche. I use the zopexmlmethods product. This gives you an easy and very reliable way to perform XSLT (and much more). 4. depending on the structure of your xml files you may find useful to write an import routine which split the xml files in chunks and create a structure of custom zope classes; this is not really necessary, but I think is a best practice as is performance-friendly; you will ned at least one container class derived from Folder and a content class derived from SimpleItem and both need to be catalog aware. This will definitely helps you in indexing and bulding a navigation path through the html produced by the XSLT. Obviously having one or more dtd describing your xml content would be very advisable. 5. for indexing the xml content you need some xml stripping code which extracts the content as unicode and feed the textIndexes you need (I use the TextIndexNG products instead of the standard Zope textIndex). All this gives you a tremendous amount of flexibility and a very scalable infrastructure. Lessons I learned developing all this: - use xslt only when really necessary; i.e only for HTML (or other formats) rendering. - import the xml into ZooDB using custom classes - when importing, transform some high level structure present in the xml content to python properties (for example chapter titles, section headings, ecc.). - remember that dtml/tal is a faster templating system than plain xslt as xml parsing has a significant performance overhead. This, actually, is the old "separate logic from presentation" mantra:if you need to apply logic to your content, parse once from xml to native python structures and use python methods to do whatever you need. On this respect you my find useful two remarkable python modules: elementtree and pyXRP: both gives you an easy path from xml to native python structures. Remember also that is very easy (and fun) to create xml streams from python lists/tuples. - pay _extreme_ attention to unicode related issues: this means transforming from xml strings to unicode types as soon as you read the xml content into python - use Zcatalog as much as you can (but this should be standard Zope practice). - put everything behind apache and you will have a wonderfull three level chaching system: level0=xslt chaching made by zopexmlmethod, level1=zope standard chace system, level2=apache - use _always_ absolute urls !!! All this seems complicated , but in reality it isn't, thanks to the standard services python/Zope gives you and to the remarkable products developed by the bright folks on this list!!! Hopes this helps, __peppo
-----Original Message----- From: zope-bounces@zope.org [mailto:zope-bounces@zope.org]On Behalf Of FNk Sent: giovedi 11 settembre 2003 10.24 To: Zope Subject: [Zope] Indexing and Searching through XML files
Hi,
I got lots of XML-files(about 100.000) in a directory tree on my file system. I want to publish those files using Zope/Plone.
I need to be able to index them in their native format without having to upload them in the ZODB, let users search their contents through zope, and have the result displayed.
Is it possible? Did anybody ever do this? Any suggestions?
I'm running zope-2.6.2 and Plone-1.1.
Thanks,
Fab.
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
FNk wrote at 2003-9-11 10:23 +0200:
I got lots of XML-files(about 100.000) in a directory tree on my file system. I want to publish those files using Zope/Plone.
I need to be able to index them in their native format without having to upload them in the ZODB, let users search their contents through zope, and have the result displayed.
Is it possible? Did anybody ever do this? Any suggestions?
You use an external search engine and interface it with Zope. Chris Withers used JPE (Java Python Environment) and a standard Open Source Java search engine (from the Apache project) to do something like this. You search Zope.org for a HowTo about ZCatalog indexing of objects not inside Zope, something like "ZCatalog Everything". I would go for the "external search engine" approach, especially when your files were large. Dieter
Dieter Maurer wrote:
You use an external search engine and interface it with Zope. Chris Withers used JPE (Java Python Environment) and a standard Open Source Java search engine (from the Apache project) to do something like this.
Indeed, Lucene was the engine in question and is apparently very scalable. I think Stephan Richter has implemented a better interface to Lucene for a Zope 3 project, might be good to ask him...
You search Zope.org for a HowTo about ZCatalog indexing of objects not inside Zope, something like "ZCatalog Everything".
...or do this. ZCTextIndex isn't at all bad nowadays...
I would go for the "external search engine" approach, especially when your files were large.
Yes, and remember, you need to understand what bits of the file you want indexed. If the files contain a mixture of metadata and content, you'll have fun ;-) Chris
Dieter Maurer wrote:
FNk wrote at 2003-9-11 10:23 +0200:
I got lots of XML-files(about 100.000) in a directory tree on my file system. I want to publish those files using Zope/Plone.
I need to be able to index them in their native format without having to upload them in the ZODB, let users search their contents through zope, and have the result displayed.
Is it possible? Did anybody ever do this? Any suggestions?
You use an external search engine and interface it with Zope. Chris Withers used JPE (Java Python Environment) and a standard Open Source Java search engine (from the Apache project) to do something like this.
You search Zope.org for a HowTo about ZCatalog indexing of objects not inside Zope, something like "ZCatalog Everything".
I would go for the "external search engine" approach, especially when your files were large.
I bet http://www.zope.org/Members/rbickers/cataloganything is a more easy approach for this. A simple set up external method or a script which ZEO-mounts ZODB and does the cataloguing should do the trick. I dont believe external Java monster apps will run much faster ;) Regards Tino
participants (5)
-
Chris Withers -
Dieter Maurer -
FNk -
Giuseppe Bonelli -
Tino Wildenhain