[Zope] Strip all HTML
ken at practical.org
ken@practical.org
Tue Aug 5 16:47:39 EDT 2003
>On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
>>On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
>>> So you can try something like:
>>>
>>> -----
>>>
>>> import re
>>>
>>> style = re.compile('<style.*?>.*?</style>', re.I | re.S)
>>> script = re.compile('<script.*?>.*?</script>', re.I | re.S)
>>> tags = re.compile('<.*?>', re.S)
>>>
>>> return tags.sub('', script.sub('', style.sub('', text)))
>>
>>hmm... doesn't the tags pattern make the other two redundant?
>
>
>Not that I can see. You may be reading the last expression
>left-to-right, but it's *evaluated* inside (right) to outside (left).
>
>In some cases (<B>, for example) you want to remove only the tags. In
>other cases (<script>, <style>) you want to remove the enclosed contents
>too. I'm sure it's possible to come up with a single, highly-clever
>regex that does it all, but that's not what I was trying to demonstrate.
Thanks for your suggestions. I'm surprised that this has not been a problem for others.
My solution (for now) is ugly, but it works (until I run into something like '<scRipt>'...:
import string
import re
text = re.sub('<STYLE.*?>', '<!--', data)
text = re.sub('<SCRIPT.*?>', '<!--', text)
text = re.sub('<style.*?>', '<!--', text)
text = re.sub('<script.*?>', '<!--', text)
text = re.sub('</STYLE>', '-->', text)
text = re.sub('</SCRIPT>', '-->', text)
text = re.sub('</style>', '-->', text)
text = re.sub('</script>', '-->', text)
text = re.sub('<!--.*?-->', '', text)
text = re.sub('<.*?>', ' ', text)
return text
I was not able to get
text = re.sub('<script.*?>.*?</script>', '', text)
to work, hence the subterfuge above.
Dylan: The re.compile lines did not work for me.
I'm using Zope 2.5.1 with Python 2.1.3 (Zope binary version) on FreeBSD4.3. I remember having to specifically allow importation of the re module, but perhaps re.compile needs something else?? I get a 401 and the following traceback:
<!--
Traceback (innermost last):
File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 150, in publish_module
File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 114, in publish
File /usr/ken/www/zope/lib/python/Zope/__init__.py, line 159, in zpublisher_exception_hook
(Object: pages)
File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 98, in publish
File /usr/ken/www/zope/lib/python/ZPublisher/mapply.py, line 88, in mapply
(Object: viewPage)
File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 39, in call_object
(Object: viewPage)
File /usr/ken/www/zope/lib/python/OFS/DTMLMethod.py, line 127, in __call__
(Object: viewPage)
File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_String.py, line 473, in __call__
(Object: viewPage)
File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_With.py, line 76, in render
(Object: pageTest)
File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_In.py, line 695, in renderwob
(Object: content)
File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_Util.py, line 159, in eval
(Object: _.string.replace(noHTML(_['sequence-item'][1]), '. ', '<br><br>'))
(Info: noHTML)
File <string>, line 2, in f
File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 252, in __call__
(Object: noHTML)
File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 283, in _bindAndExec
(Object: noHTML)
File /usr/ken/www/zope/lib/python/Products/PythonScripts/PythonScript.py, line 302, in _exec
(Object: noHTML)
(Info: ({'script': <PythonScript instance at a4951d0>, 'context': <MyPage instance at a6c3c98>, 'container': <Folder instance at acfdb30>, 'traverse_subpath': []}, ('\r\n\r\n<!DOCTYPE HTML PUBLIC [snip]</html>\r\n\r\n',), {}, None))
File Script (Python), line 8, in noHTML
File /usr/ken/www/zope/lib/python/AccessControl/ZopeGuards.py, line 60, in guarded_getattr
File /usr/ken/www/zope/lib/python/AccessControl/SecurityManager.py, line 83, in validate
File /usr/ken/www/zope/lib/python/AccessControl/ZopeSecurityPolicy.py, line 145, in validate
Unauthorized: (see above)
-->
I hope for a more elegant solution which could be incorporated into the popular HTML converters.
Thanks again,
Ken
More information about the Zope
mailing list