On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
So you can try something like:
-----
import re
style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S)
return tags.sub('', script.sub('', style.sub('', text)))
hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression left-to-right, but it's *evaluated* inside (right) to outside (left).
In some cases (<B>, for example) you want to remove only the tags. In other cases (<script>, <style>) you want to remove the enclosed contents too. I'm sure it's possible to come up with a single, highly-clever regex that does it all, but that's not what I was trying to demonstrate.
Thanks for your suggestions. I'm surprised that this has not been a problem for others. My solution (for now) is ugly, but it works (until I run into something like '<scRipt>'...: import string import re text = re.sub('<STYLE.*?>', '<!--', data) text = re.sub('<SCRIPT.*?>', '<!--', text) text = re.sub('<style.*?>', '<!--', text) text = re.sub('<script.*?>', '<!--', text) text = re.sub('</STYLE>', '-->', text) text = re.sub('</SCRIPT>', '-->', text) text = re.sub('</style>', '-->', text) text = re.sub('</script>', '-->', text) text = re.sub('<!--.*?-->', '', text) text = re.sub('<.*?>', ' ', text) return text I was not able to get text = re.sub('<script.*?>.*?</script>', '', text) to work, hence the subterfuge above. Dylan: The re.compile lines did not work for me. I'm using Zope 2.5.1 with Python 2.1.3 (Zope binary version) on FreeBSD4.3. I remember having to specifically allow importation of the re module, but perhaps re.compile needs something else?? I get a 401 and the following traceback: <!-- Traceback (innermost last): File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 150, in publish_module File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 114, in publish File /usr/ken/www/zope/lib/python/Zope/__init__.py, line 159, in zpublisher_exception_hook (Object: pages) File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 98, in publish File /usr/ken/www/zope/lib/python/ZPublisher/mapply.py, line 88, in mapply (Object: viewPage) File /usr/ken/www/zope/lib/python/ZPublisher/Publish.py, line 39, in call_object (Object: viewPage) File /usr/ken/www/zope/lib/python/OFS/DTMLMethod.py, line 127, in __call__ (Object: viewPage) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_String.py, line 473, in __call__ (Object: viewPage) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_With.py, line 76, in render (Object: pageTest) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_In.py, line 695, in renderwob (Object: content) File /usr/ken/www/zope/lib/python/DocumentTemplate/DT_Util.py, line 159, in eval (Object: _.string.replace(noHTML(_['sequence-item'][1]), '. ', '<br><br>')) (Info: noHTML) File <string>, line 2, in f File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 252, in __call__ (Object: noHTML) File /usr/ken/www/zope/lib/python/Shared/DC/Scripts/Bindings.py, line 283, in _bindAndExec (Object: noHTML) File /usr/ken/www/zope/lib/python/Products/PythonScripts/PythonScript.py, line 302, in _exec (Object: noHTML) (Info: ({'script': <PythonScript instance at a4951d0>, 'context': <MyPage instance at a6c3c98>, 'container': <Folder instance at acfdb30>, 'traverse_subpath': []}, ('\r\n\r\n<!DOCTYPE HTML PUBLIC [snip]</html>\r\n\r\n',), {}, None)) File Script (Python), line 8, in noHTML File /usr/ken/www/zope/lib/python/AccessControl/ZopeGuards.py, line 60, in guarded_getattr File /usr/ken/www/zope/lib/python/AccessControl/SecurityManager.py, line 83, in validate File /usr/ken/www/zope/lib/python/AccessControl/ZopeSecurityPolicy.py, line 145, in validate Unauthorized: (see above) --> I hope for a more elegant solution which could be incorporated into the popular HTML converters. Thanks again, Ken
It seems as if letting things like <scRipt> through makes the whole exercise pointless. Surely people who play these games think of things like that. You might consider using one of the existing HTML strippers out there, like stripogram (available in the squishdot distribution) or SafeHTML. Unfortunately, neither of them deals correctly with this example: <A HREF="http://example.com/comment.cgi? mycomment=<SCRIPT>malicious code</SCRIPT>">malicious code"> Click here</A> which is mentioned in the CERT advisory at http://www.cert.org/advisories/CA-2000-02.html HTH. Alex.
Alex Coventry wrote:
<A HREF="http://example.com/comment.cgi? mycomment=<SCRIPT>malicious code</SCRIPT>">malicious code"> Click here</A>
What would you expect to have happen here? cheers, Chris
On Tue, Aug 05, 2003 at 05:47:39PM +0200, ken@practical.org wrote:
text = re.sub('</SCRIPT>', '-->', text) text = re.sub('</style>', '-->', text) text = re.sub('</script>', '-->', text) text = re.sub('</STYLE>', '-->', text)
note that for these simplest expressions, e.g. re.sub('</STYLE>', '-->', text) , it's equivalent (but faster) to do text.replace('</STYLE>', '-->'). But that's a quibble.
Dylan: The re.compile lines did not work for me. I'm using Zope 2.5.1 with Python 2.1.3 (Zope binary version) on FreeBSD4.3. I remember having to specifically allow importation of the re module, but perhaps re.compile needs something else?? I get a 401 and the following traceback:
(snip) You might be right. I'd suggest doing it as an External Method, then you can use all of the re module without restrictions. The re.compile version should be cleaner and more reliable than explicitly checking different cases. -- Paul Winkler http://www.slinkp.com Look! Up in the sky! It's THE SEISMIC GIRL! (random hero from isometric.spaceninja.com)
On Tue, 2003-08-05 at 08:47, ken@practical.org wrote:
Dylan: The re.compile lines did not work for me.
Ah, yes. I suppose I should have seen that coming. By default, Python Scripts restrict what you can do. You'll either need to run that code in an external method or configure Zope to allow re (and creation of re objects, IIRC) in Python scripts. HTH, Dylan
Dylan Reinhardt wrote:
By default, Python Scripts restrict what you can do. You'll either need to run that code in an external method or configure Zope to allow re (and creation of re objects, IIRC) in Python scripts.
My memory tells me that I spent a fair bit of time making the stripogram functions work from restricted python space, so I would suggets that as the way to go. I'm not sure exactly what ken is trying to achieve, or how he tried stripogram, but I'd be suprised if it didn't support what he wanted to do... cheers, Chris
participants (5)
-
Alex Coventry -
Chris Withers -
Dylan Reinhardt -
ken@practical.org -
Paul Winkler