I have an older Zope install and i want to enable searching of Page Templates and PDFs. Because it's an older Zope version (2.8.5) I have had to go back a few revisions of TextIndexNG3 (3.1.16) and Five (1.2.6) Install seems fine including setup of the extensions modules. I create a textIndexNG Index called PrincipiaSearchSource, Converters show that pdftotext is available and HTML to ASCII is 'always' available. I find all Page Templates and PDFs and Catalog them. They do show up in the Catalog, but the Page Templates have all their HTML tags included in the catalog(I thought they would be stripped automagically) and the PDFs have no words cataloged at all. Any suggestions appreciated. Thanks, Erik Myllymaki Zope Version (Zope 2.8.5-final, python 2.3.5, linux2) Python Version 2.3.5 (#1, Jan 3 2006, 23:22:48) [GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-52)] System Platform linux2 SOFTWARE_HOME /usr/local/Zope-2.8.5/lib/python ZOPE_HOME /usr/local/Zope-2.8.5 Five (Installed product Five (Five 1.2.6)) TextIndexNG3 (Installed product TextIndexNG3 (3.1.16))
--On 17. Februar 2008 22:51:47 -0800 Erik Myllymaki <erik.myllymaki@aviawest.com> wrote:
I have an older Zope install and i want to enable searching of Page Templates and PDFs.
Because it's an older Zope version (2.8.5) I have had to go back a few revisions of TextIndexNG3 (3.1.16) and Five (1.2.6)
Install seems fine including setup of the extensions modules.
I create a textIndexNG Index called PrincipiaSearchSource, Converters show that pdftotext is available and HTML to ASCII is 'always' available.
I find all Page Templates and PDFs and Catalog them. They do show up in the Catalog, but the Page Templates have all their HTML tags included in the catalog(I thought they would be stripped automagically)
Your expectations are wrong. If an object does not provide IIndexableContent or if there is no adapter for this then TXNG3 will default to the "old" Zope 2 indexing behaviour and index the string representation of the content as it is.
and the PDFs have no words cataloged at all.
If you have the external converters installed and if they are in the $PATH and available to the Python interpreter process then I have strong doubts about that. Trible check that. If necessary take the debugger for checking the calls of the external converters. -aj
Andreas Jung wrote:
I find all Page Templates and PDFs and Catalog them. They do show up in the Catalog, but the Page Templates have all their HTML tags included in the catalog(I thought they would be stripped automagically)
Your expectations are wrong. If an object does not provide IIndexableContent or if there is no adapter for this then TXNG3 will default to the "old" Zope 2 indexing behaviour and index the string representation of the content as it is.
I had thought that since ZPT has a content type of text/html, that TextIndexNG3 would lookup and use the converter listed on the Converters page for the "text/html" mimetype: "Converter HTML to ASCII" "always" How do I enable this behaviour?
--On 18. Februar 2008 10:17:11 -0800 Erik Myllymaki <erik.myllymaki@aviawest.com> wrote:
Andreas Jung wrote:
I find all Page Templates and PDFs and Catalog them. They do show up in the Catalog, but the Page Templates have all their HTML tags included in the catalog(I thought they would be stripped automagically)
Your expectations are wrong. If an object does not provide IIndexableContent or if there is no adapter for this then TXNG3 will default to the "old" Zope 2 indexing behaviour and index the string representation of the content as it is.
I had thought that since ZPT has a content type of text/html, that TextIndexNG3 would lookup and use the converter listed on the Converters page for the "text/html" mimetype: "Converter HTML to ASCII" "always"
How do I enable this behaviour?
Please read what I wrote. This is not supported and won't be supported out-of-the-box. If you need this feature you have to fulfill the requirements I mentioned and that are documented in the TXNG3' README.txt file. -aj
Andreas Jung wrote:
and the PDFs have no words cataloged at all.
If you have the external converters installed and if they are in the $PATH and available to the Python interpreter process then I have strong doubts about that. Trible check that. If necessary take the debugger for checking the calls of the external converters.
I'm pretty sure the converters are installed properly, but the issue is prior to the calling of any converters. Stepping through textindexng/content.py - extract_content(): 139 for f in icc.getFields(): 140 141 -> d = icc.getFieldData(f) 142 143 # check if we need to convert 144 if d.has_key('mimetype'): d never has a key of 'mimetype'.
participants (2)
-
Andreas Jung -
Erik Myllymaki