On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
So you can try something like:
-----
import re
style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S)
return tags.sub('', script.sub('', style.sub('', text)))
hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression left-to-right, but it's *evaluated* inside (right) to outside (left).
In some cases (<B>, for example) you want to remove only the tags. In other cases (<script>, <style>) you want to remove the enclosed contents too. I'm sure it's possible to come up with a single, highly-clever regex that does it all, but that's not what I was trying to demonstrate.
Thanks for your suggestions. I'm surprised that this has not been a problem for others. My solution (for now) is ugly, but it works (until I run into something like '<scRipt>'...: import string import re text = re.sub('<STYLE.*?>', '<!--', data) text = re.sub('<SCRIPT.*?>', '<!--', text) text = re.sub('<style.*?>', '<!--', text) text = re.sub('<script.*?>', '<!--', text) text = re.sub('</STYLE>', '-->', text) text = re.sub('</SCRIPT>', '-->', text) text = re.sub('</style>', '-->', text) text = re.sub('</script>', '-->', text) text = re.sub('<!--.*?-->', '', text) text = re.sub('<.*?>', ' ', text) return text I was not able to get text = re.sub('<script.*?>.*?</script>', '', text) to work, hence the subterfuge above. Dylan: The re.compile lines did not work for me. I'm using Zope 2.5.1 with Python 2.1.3 (Zope binary version) on FreeBSD4.3. I remember having to specifically allow importation of the re module, but perhaps re.compile needs something else?? I get a 401 and the following traceback: <!-- Traceback (innermost last): File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 150, in publish_module File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 114, in publish File /usr/ken/www/zope/lib/python/Zope/__init__.py, line 159, in zpublisher_exception_hook (Object: pages) File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 98, in publish File /usr/ken/www/zope/lib/python/ZPublisher/mapply.py, line 88, in mapply (Object: viewPage) File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 39, in call_object (Object: viewPage) File /usr/ken/www/zope/lib/python/OFS/DTMLMethod.py, line 127, in __call__ (Object: viewPage) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_String.py, line 473, in __call__ (Object: viewPage) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_With.py, line 76, in render (Object: pageTest) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_In.py, line 695, in renderwob (Object: content) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_Util.py, line 159, in eval (Object: _.string.replace(noHTML(_['sequence-item'][1]), '. ', '<br><br>')) (Info: noHTML) File <string>, line 2, in f File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 252, in __call__ (Object: noHTML) File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 283, in _bindAndExec (Object: noHTML) File /usr/ken/www/zope/lib/python/Products/PythonScripts/PythonScript.py, line 302, in _exec (Object: noHTML) (Info: ({'script': <PythonScript instance at a4951d0>, 'context': <MyPage instance at a6c3c98>, 'container': <Folder instance at acfdb30>, 'traverse_subpath': []}, ('\r\n\r\n<!DOCTYPE HTML PUBLIC [snip]</html>\r\n\r\n',), {}, None)) File Script (Python), line 8, in noHTML File /usr/ken/www/zope/lib/python/AccessControl/ZopeGuards.py, line 60, in guarded_getattr File /usr/ken/www/zope/lib/python/AccessControl/SecurityManager.py, line 83, in validate File /usr/ken/www/zope/lib/python/AccessControl/ZopeSecurityPolicy.py, line 145, in validate Unauthorized: (see above) --> I hope for a more elegant solution which could be incorporated into the popular HTML converters. Thanks again, Ken