Re: [Zope] Strip all HTML
Chris Withers wrote:
Are there any other tags where the content should be removed?
AFAICT, the HTML elements which need to be removed together with their content are: style, script, noscript and noframes. At least those are the most common non-proprietary ones. My strategy was to transform the opening tag into '<!--' and closing one into '-->', and then get rid of '<!--.*?-->', but there must be a more clever way. I would love to have a fix for Dieter's CatalogSupport.py, since that module was intended for my first use case: to prevent indexing of irrelevant markup; it is already used by the DocumentLibrary product. My other use case, the display of a text-only version of a web page, also requires removal of all markup and markup-related content. Is there a reason for any of the HTML conversion modules *not* to incorporate this addition? I am just surprised that no one has reported it as a problem. Thanks to those who are contributing to this thread! Ken
On Thu, 7 Aug 2003 14:22:26 +0200 (CEST) GMT (..14:22 where i live(GMT+2) ) ken@practical.org asked the Zope mailinglist about the following:
Chris Withers wrote:
Are there any other tags where the content should be removed?
AFAICT, the HTML elements which need to be removed together with their content are: style, script, noscript and noframes. At least those are the most common non-proprietary ones.
Wouldn't removing the contents of noscript and noframes be off-target? They exist solely for the purpose of providing alternative content to what would be rendered by a script or in a frame. It appears to me to be practical to keep this alternative content, both in a text-only version, and in the indexes in a catalog (unless you use them in the non-recommended, html3.2 way (e.g : "you need frames to see this content"), that is...) :) -- Geir Bækholt
ken@practical.org wrote:
Chris Withers wrote:
Are there any other tags where the content should be removed?
AFAICT, the HTML elements which need to be removed together with their content are: style, script, noscript and noframes. At least those are the most common non-proprietary ones.
Hmmm, I'll have to look at adding something for this in the next release of Strip-O-Gram... cheers, Chris
participants (3)
-
Chris Withers -
Geir Bækholt -
ken@practical.org