[Zope] Malicious HTML
Duncan Booth
duncan@rcp.co.uk
Wed, 8 Mar 2000 17:31:59 +0000
> Graham Chiu wrote:
> >
> > I wish to allow users to enter comments into my database which are then
> > viewable thru the browser.
> >
> > Is there a Zope function that I can pass their text thru to remove all
> > HTML?
>
> Nope. There may be some standard python library module that I don't
> know about, however. Otherwise, you will have to write your own.
>
> -Michel
>
Try htmllib.
The following bit of python will strip all formatting from some HTML. It
replaces all anchors with footnote style references and images with
their alt text. If you want something a bit fancier you could add
methods to the MyParser class to pass through particular tags (see
the commented out methods as an example). It shouldn't be too hard
to wrap something like this up in an external method (as presented it
is a complete runnable program that retrieves a URL and displays
the text).
--------- File strip.py --------------
# Strip all HTML formatting.
import sys,formatter,StringIO,htmllib,string
from urllib import urlretrieve,urlcleanup
class MyParser(htmllib.HTMLParser):
def __init__(self):
self.bodytext = StringIO.StringIO()
writer = formatter.DumbWriter(self.bodytext)
htmllib.HTMLParser.__init__(self,
formatter.AbstractFormatter(writer))
def gettext(self):
return self.bodytext.getvalue()
# Uncomment these to pass through bold tags.
# def start_b(self, attrs):
# self.formatter.add_flowing_data('<b>')
#
# def end_b(self):
# self.formatter.add_flowing_data('</b>')
def GetPage(url):
try:
fn, h = urlretrieve(url)
text = open(fn, "r").read()
finally:
urlcleanup()
return text
if __name__=='__main__':
data = GetPage(sys.argv[1])
p = MyParser()
p.feed(data)
p.close()
text = string.replace(p.gettext(), '\xa0', ' ')
print text
anchors = p.anchorlist
for i in range(len(anchors)):
print "[%d]: %s" % (i+1, anchors[i])
--------- end of strip.py --------------
--
Duncan Booth duncan@dales.rmplc.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
http://dales.rmplc.co.uk/Duncan