Chris Withers wrote:
Are there any other tags where the content should be removed?
AFAICT, the HTML elements which need to be removed together with their content are: style, script, noscript and noframes. At least those are the most common non-proprietary ones. My strategy was to transform the opening tag into '<!--' and closing one into '-->', and then get rid of '<!--.*?-->', but there must be a more clever way. I would love to have a fix for Dieter's CatalogSupport.py, since that module was intended for my first use case: to prevent indexing of irrelevant markup; it is already used by the DocumentLibrary product. My other use case, the display of a text-only version of a web page, also requires removal of all markup and markup-related content. Is there a reason for any of the HTML conversion modules *not* to incorporate this addition? I am just surprised that no one has reported it as a problem. Thanks to those who are contributing to this thread! Ken