[Zope] Strip all HTML

ken at practical.org ken@practical.org
Tue Aug 5 16:47:39 EDT 2003


>On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
>>On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
>>> So you can try something like:
>>> 
>>> -----
>>> 
>>> import re
>>> 
>>> style = re.compile('<style.*?>.*?</style>', re.I | re.S)
>>> script = re.compile('<script.*?>.*?</script>', re.I | re.S)
>>> tags = re.compile('<.*?>', re.S)
>>> 
>>> return tags.sub('', script.sub('', style.sub('', text)))
>>
>>hmm... doesn't the tags pattern make the other two redundant?
>
>
>Not that I can see. You may be reading the last expression
>left-to-right, but it's *evaluated* inside (right) to outside (left).
>
>In some cases (<B>, for example) you want to remove only the tags. In
>other cases (<script>, <style>) you want to remove the enclosed contents
>too. I'm sure it's possible to come up with a single, highly-clever
>regex that does it all, but that's not what I was trying to demonstrate.

Thanks for your suggestions. I'm surprised that this has not been a problem for others.
My solution (for now) is ugly, but it works (until I run into something like '<scRipt>'...:

import string
import re

text = re.sub('<STYLE.*?>', '<!--', data)
text = re.sub('<SCRIPT.*?>', '<!--', text)
text = re.sub('<style.*?>', '<!--', text)
text = re.sub('<script.*?>', '<!--', text)
text = re.sub('</STYLE>', '-->', text)
text = re.sub('</SCRIPT>', '-->', text)
text = re.sub('</style>', '-->', text)
text = re.sub('</script>', '-->', text)
text = re.sub('<!--.*?-->', '', text)
text = re.sub('<.*?>', ' ', text)
return text

I was not able to get
text = re.sub('<script.*?>.*?</script>', '', text)
to work, hence the subterfuge above.

Dylan: The re.compile lines did not work for me.
I'm using Zope 2.5.1 with Python 2.1.3 (Zope binary version) on FreeBSD4.3. I remember having to specifically allow importation of the re module, but perhaps re.compile needs something else?? I get a 401 and the following traceback:

<!--
Traceback (innermost last):
  File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 150, in publish_module
  File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 114, in publish
  File /usr/ken/www/zope/lib/python/Zope/__init__.py, line 159, in zpublisher_exception_hook
    (Object: pages)
  File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 98, in publish
  File /usr/ken/www/zope/lib/python/ZPublisher/mapply.py, line 88, in mapply
    (Object: viewPage)
  File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 39, in call_object
    (Object: viewPage)
  File /usr/ken/www/zope/lib/python/OFS/DTMLMethod.py, line 127, in __call__
    (Object: viewPage)
  File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_String.py, line 473, in __call__
    (Object: viewPage)
  File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_With.py, line 76, in render
    (Object: pageTest)
  File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_In.py, line 695, in renderwob
    (Object: content)
  File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_Util.py, line 159, in eval
    (Object: _.string.replace(noHTML(_['sequence-item'][1]), '. ', '&lt;br&gt;&lt;br&gt;'))
    (Info: noHTML)
  File &lt;string&gt;, line 2, in f
  File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 252, in __call__
    (Object: noHTML)
  File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 283, in _bindAndExec
    (Object: noHTML)
  File /usr/ken/www/zope/lib/python/Products/PythonScripts/PythonScript.py, line 302, in _exec
    (Object: noHTML)
    (Info: ({'script': &lt;PythonScript instance at a4951d0&gt;, 'context': &lt;MyPage instance at a6c3c98&gt;, 'container': &lt;Folder instance at acfdb30&gt;, 'traverse_subpath': []}, ('\r\n\r\n&lt;!DOCTYPE HTML PUBLIC [snip]&lt;/html&gt;\r\n\r\n',), {}, None))
  File Script (Python), line 8, in noHTML
  File /usr/ken/www/zope/lib/python/AccessControl/ZopeGuards.py, line 60, in guarded_getattr
  File /usr/ken/www/zope/lib/python/AccessControl/SecurityManager.py, line 83, in validate
  File /usr/ken/www/zope/lib/python/AccessControl/ZopeSecurityPolicy.py, line 145, in validate
Unauthorized: (see above)

-->

I hope for a more elegant solution which could be incorporated into the popular HTML converters.

Thanks again,

Ken






More information about the Zope mailing list