Hi all, I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported). This product uses the 'Catalog Support' HTML converter available here: http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'... Has anyone else confronted this problem? I have also made feeble attempts such as the following Script (Python), without success: import string import re text = re.sub('<STYLE.*?>.*?</STYLE>', '', data) text = re.sub('<STYLE.*?>.*?</STYLE>', '', text) text = re.sub('<style.*?>.*?</style>', '', text) text = re.sub('<script.*?>.*?</script>', '', text) text = re.sub('<!--.*?-->', '', text) text = re.sub('<.*?>', ' ', text) return text I sure would appreciate some help on this... Thanks, Ken
You didn't mention what problem you're having... but it would appear that case-sensitive matching is one of them. re.sub (sadly) doesn't support flags like I (ignore case) or S (dot matches newline character). I have no idea why not. However, not all is lost. re.compile supports flags and will give you an object that has a sub method. Go figure. So you can try something like: ----- import re style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S) return tags.sub('', script.sub('', style.sub('', text))) ----- Note that in this case there is no need to check for comments separately... they'll be matched by the tags pattern. Once that works, you may want to do some other things like replace <br> with line breaks, etc. But this should be enough to make progress with. HTH, Dylan On Tue, 2003-08-05 at 05:26, ken@practical.org wrote:
Hi all,
I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported).
This product uses the 'Catalog Support' HTML converter available here:
http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html
However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
Has anyone else confronted this problem?
I have also made feeble attempts such as the following Script (Python), without success:
import string import re
text = re.sub('<STYLE.*?>.*?</STYLE>', '', data) text = re.sub('<STYLE.*?>.*?</STYLE>', '', text) text = re.sub('<style.*?>.*?</style>', '', text) text = re.sub('<script.*?>.*?</script>', '', text) text = re.sub('<!--.*?-->', '', text) text = re.sub('<.*?>', ' ', text) return text
I sure would appreciate some help on this...
Thanks,
Ken
_______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
So you can try something like:
-----
import re
style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S)
return tags.sub('', script.sub('', style.sub('', text)))
hmm... doesn't the tags pattern make the other two redundant? one problem with this approach is that it removes any xml or sgml markup from inside a <pre> block, which may not be what you want. Processing html with regular expressions is notoriously frustrating. I'd look in to using htmllib from the standard library, or fixing Strip-O-Gram to do what you want. -- Paul Winkler http://www.slinkp.com Look! Up in the sky! It's THE UNWORTHY SEEKER! (random hero from isometric.spaceninja.com)
On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
On Tue, Aug 05, 2003 at 06:39:11AM -0700, Dylan Reinhardt wrote:
So you can try something like:
-----
import re
style = re.compile('<style.*?>.*?</style>', re.I | re.S) script = re.compile('<script.*?>.*?</script>', re.I | re.S) tags = re.compile('<.*?>', re.S)
return tags.sub('', script.sub('', style.sub('', text)))
hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression left-to-right, but it's *evaluated* inside (right) to outside (left). In some cases (<B>, for example) you want to remove only the tags. In other cases (<script>, <style>) you want to remove the enclosed contents too. I'm sure it's possible to come up with a single, highly-clever regex that does it all, but that's not what I was trying to demonstrate. Dylan
On Tue, Aug 05, 2003 at 08:02:28AM -0700, Dylan Reinhardt wrote:
On Tue, 2003-08-05 at 07:49, Paul Winkler wrote:
hmm... doesn't the tags pattern make the other two redundant?
Not that I can see. You may be reading the last expression left-to-right, but it's *evaluated* inside (right) to outside (left).
no, i was just not really paying attention to what the different expressions would match. sorry!
In some cases (<B>, for example) you want to remove only the tags. In other cases (<script>, <style>) you want to remove the enclosed contents too.
right, i don't know how this simple fact escaped me.
I'm sure it's possible to come up with a single, highly-clever regex that does it all,
god, I hope not :-\ -- Paul Winkler http://www.slinkp.com Look! Up in the sky! It's BLITHE SHOOPSTERSTOOPER ESTATE! (random hero from isometric.spaceninja.com)
ken@practical.org wrote:
However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
I beg to differ: Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
from stripogram import html2text html = "seem unable to remove the content of <style>stuff</style> or <script more stuff</script>" html2text(html) 'seem unable to remove the content of stuff or more stuff'
How are you using stripogram? cheers, Chris
-----Oprindelig meddelelse----- Fra: zope-admin@zope.org [mailto:zope-admin@zope.org]Pa vegne af Chris Withers Sendt: 6. august 2003 13:30
ken@practical.org wrote:
However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
I beg to differ:
Python 2.2.2 (#37, Oct 14 2002, 17:02:34) [MSC 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
from stripogram import html2text html = "seem unable to remove the content of <style>stuff</style> or <script more stuff</script>" html2text(html) 'seem unable to remove the content of stuff or more stuff'
How are you using stripogram?
Your own example shows that stripogram does NOT remove the content between <style>...</style> and <script>...</script>. What ken wants, is the result (of your example) to look like this: 'seem unable to remove the content of or' - Carsten
Carsten Gehling wrote:
Your own example shows that stripogram does NOT remove the content between <style>...</style> and <script>...</script>.
Ah, the key there is content ;-) Hmmm... have to have a think about that one... Are there any other tags where the content should be removed? cheers, Chris
participants (5)
-
Carsten Gehling -
Chris Withers -
Dylan Reinhardt -
ken@practical.org -
Paul Winkler