You didn't mention what problem you're having... but it would appear that case-sensitive matching is one of them. re.sub (sadly) doesn't support flags like I (ignore case) or S (dot matches newline character). I have no idea why not. However, not all is lost. re.compile supports flags and will give you an object that has a sub method. Go figure. So you can try something like: ----- import re style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S) return tags.sub('', script.sub('', style.sub('', text))) ----- Note that in this case there is no need to check for comments separately... they'll be matched by the tags pattern. Once that works, you may want to do some other things like replace <br> with line breaks, etc. But this should be enough to make progress with. HTH, Dylan On Tue, 2003-08-05 at 05:26, ken@practical.org wrote:
Hi all,
I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported).
This product uses the 'Catalog Support' HTML converter available here:
http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html
However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
Has anyone else confronted this problem?
I have also made feeble attempts such as the following Script (Python), without success:
import string import re
text = re.sub('<STYLE.*?>.*?</STYLE>', '', data) text = re.sub('<STYLE.*?>.*?</STYLE>', '', text) text = re.sub('<style.*?>.*?</style>', '', text) text = re.sub('<script.*?>.*?</script>', '', text) text = re.sub('<!--.*?-->', '', text) text = re.sub('<.*?>', ' ', text) return text
I sure would appreciate some help on this...
Thanks,
Ken
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )