[ZPT] OT (and probably a bit long ;-) HTML Filtering
Chris Withers
chrisw@nipltd.com
Wed, 16 May 2001 12:25:37 +0100
Hi :-)
I'm now onto my fourth or fifth mailing list here but I think this is finally
the right list, even if this may seem a bit off topic :-)
I have a python module called Strip-O-Gram
(http://www.zope.org/Members/chrisw/StripOGram) which is supposed to take dodgy
HTML and filter it, closing any open tags, removing JavaScript, etc.
This was originally written for Squishdot (http://www.squishdot.org) but other
people seem to be finding it useful now so I'm trying to make it work like it
should :-)
Anyway, J M Cerqueira Esteves <jmce@artenumerica.com> found some problems with
it and reported them on the Zope list:
> > html2safehtml ('Roses <b>are</B> red,<br/<blink>QUACK<//blink> violets '
> > '<i>are</i> blue',
> > valid_tags=['b','i','br'])
> >
> > successfully smuggling a <blink>...</blink> inside the result:
> >
> > 'Roses <b>are</b> red,<br><blink>QUACK</blink> violets <i>are</i> blue'
> >
> > (Notice that the closing '</i>' is now OK again, and that I had to use
> > '<//blink>' in order to get '</blink>'.
The problem here seems to be with the parser in sgmllib.py:
> When parsing the following HTML:
>
> 'Roses <b>are</B> red,<br/>violets <i>are</i> blue'
>
> ...with the following class:
>
> class HTML2SafeHTML(sgmllib.SGMLParser):
>
> def handle_data(self, data):
> print "***data***"
> print data
>
> def unknown_starttag(self, tag, attrs):
> print "***start**"
> print tag
> print (attrs)
>
> def unknown_endtag(self, tag):
> print "***end**"
> print tag
>
> I get the following output, which isn't right :-S
>
> ***data***
> Roses
> ***start**
> b
> []
> ***data***
> are
> ***end**
> b
> ***data***
> red,
> ***start**
> br
> []
> ***data***
> >violets <i>are<
> ***end**
> br
> ***data***
> i> blue
(sorry for that being so long...)
Anyway, Ethan pointed out that you guys have probably got quite good at this
sort of thing while developing ZPT...
So, how should I be approaching this problem?
many thanks for any help,
Chris