[Zope] Strip all HTML
ken at practical.org
ken@practical.org
Tue Aug 5 13:26:54 EDT 2003
Hi all,
I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported).
This product uses the 'Catalog Support' HTML converter available here:
http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html
However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
Has anyone else confronted this problem?
I have also made feeble attempts such as the following Script (Python), without success:
import string
import re
text = re.sub('<STYLE.*?>.*?</STYLE>', '', data)
text = re.sub('<STYLE.*?>.*?</STYLE>', '', text)
text = re.sub('<style.*?>.*?</style>', '', text)
text = re.sub('<script.*?>.*?</script>', '', text)
text = re.sub('<!--.*?-->', '', text)
text = re.sub('<.*?>', ' ', text)
return text
I sure would appreciate some help on this...
Thanks,
Ken
More information about the Zope
mailing list