[Zope] Strip all HTML

Dylan Reinhardt zope@dylanreinhardt.com
Tue Aug 5 16:02:28 EDT 2003


On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
> On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
> > So you can try something like:
> > 
> > -----
> > 
> > import re
> > 
> > style = re.compile('<style.*?>.*?</style>', re.I | re.S)
> > script = re.compile('<script.*?>.*?</script>', re.I | re.S)
> > tags = re.compile('<.*?>', re.S)
> > 
> > return tags.sub('', script.sub('', style.sub('', text)))
> 
> hmm... doesn't the tags pattern make the other two redundant?

Not that I can see.  You may be reading the last expression
left-to-right, but it's *evaluated* inside (right) to outside (left).

In some cases (<B>, for example) you want to remove only the tags.  In
other cases (<script>, <style>) you want to remove the enclosed contents
too.  I'm sure it's possible to come up with a single, highly-clever
regex that does it all, but that's not what I was trying to demonstrate.

Dylan





More information about the Zope mailing list