[Zope-CMF] Re: ZCSearchPatch
Casey Duncan
casey@zope.com
Wed, 14 May 2003 10:47:09 -0400
Actually this should be very easy to fix, see inline comment below:
On Wednesday 14 May 2003 10:36 am, Eric Dunn wrote:
> ZCatalog issue:
> Have code to strip out html tags so that the ZCatalog
> does not pick up the html code when catalogging.
> Works great... almost too good.
> Our users are only copy-n-paste managers.
> I found that stripping the " " (html space tag)
> makes the catalog concantenate text... i.e.
> 1234 1234 1234 1234 becomes '1234123412341234' in the
> catalog.
>=20
> Question: How can I tell the SearchPatch.py file to
> ignore the space tag or treat it as a space?
>=20
>=20
> import re
> from SearchIndex.UnTextIndex import UnTextIndex
> from string import find
>=20
> # HTML regex to substitute tags and entities
> html_re =3D re.compile(r'<[^\s0-9].*?>|&[a-zA-Z]*?;',
> re.DOTALL)
>=20
> class FauxDocument:
> """Proxy document to store munged source text"""
> def __init__(self, name, value):
> setattr(self, name, value)
>=20
> # Get a reference to the original index_object method=20
> # so we can head patch it
> original_index_object =3D UnTextIndex.index_object
>=20
> def index_object(self, documentId, obj,
> threshold=3DNone):
> # sniff the object for our 'id', the 'document
> source' of the
> # index is this attribute. If it smells callable,
> call it.
> try:
> source =3D getattr(obj, self.id)
> if callable(source):
> source =3D str(source())
> else:
> source =3D str(source)
> except (AttributeError, TypeError):
> return 0
> =20
> if find(source, '<') !=3D -1:
> # Strip HTML tags and comments from source
> source =3D html_re.sub('', source)
Change the above line to:
source =3D html_re.sub(' ', source)
(Insert a space between the single quotes)
> # Create faux document with stripped source
> content
> obj =3D FauxDocument(self.id, source)
> =20
> # Call original index method
> return original_index_object(self, documentId,
> obj, threshold)
>=20
> # Patch UnTextIndex class
> UnTextIndex.index_object =3D index_object
Hope that helps,
-Casey