On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
So you can try something like:
-----
import re
style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S)
return tags.sub('', script.sub('', style.sub('', text)))
hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression left-to-right, but it's *evaluated* inside (right) to outside (left). In some cases (<B>, for example) you want to remove only the tags. In other cases (<script>, <style>) you want to remove the enclosed contents too. I'm sure it's possible to come up with a single, highly-clever regex that does it all, but that's not what I was trying to demonstrate. Dylan