[Zope] Strip all HTML
Dylan Reinhardt
zope@dylanreinhardt.com
Tue Aug 5 16:02:28 EDT 2003
On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
> On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
> > So you can try something like:
> >
> > -----
> >
> > import re
> >
> > style = re.compile('<style.*?>.*?</style>', re.I | re.S)
> > script = re.compile('<script.*?>.*?</script>', re.I | re.S)
> > tags = re.compile('<.*?>', re.S)
> >
> > return tags.sub('', script.sub('', style.sub('', text)))
>
> hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression
left-to-right, but it's *evaluated* inside (right) to outside (left).
In some cases (<B>, for example) you want to remove only the tags. In
other cases (<script>, <style>) you want to remove the enclosed contents
too. I'm sure it's possible to come up with a single, highly-clever
regex that does it all, but that's not what I was trying to demonstrate.
Dylan
More information about the Zope
mailing list