[Zope-CMF] link checking a CMF site
Tres Seaver
tseaver@zope.com
Sat, 2 Feb 2002 14:43:54 -0500 (EST)
On Thu, 31 Jan 2002, Ian Clatworthy wrote:
> We have a CMF-based portal with lots of content (both HTML and
> Structured Text) that I want to link check as soon as possible.
> The 1.2 documentation mentions a link checker as something to
> expect post 1.2. Has this development started yet? If so, is
> the code 1.2-specific or can I use it on a 1.1-based site?
I spent a fair amount of time looking for an existing Python
link-ripper (the first part of a checker) before I found that
it was so simple in Python that nobody had packaged it.
#! /usr/bin/python
import re
import urlparse
class LinkRipper:
"""
Package utilities for ripping HTML and STX links from a
string or a file.
"""
href = re.compile( r'href="(.*?)"', re.IGNORECASE )
a_href = re.compile( r'<a.*?href="(.*?)"', re.IGNORECASE )
img_src = re.compile( r'<img.*?src="(.*?)"', re.IGNORECASE )
link_href = re.compile( r'<link.*?href="(.*?)"', re.IGNORECASE )
def _rip( self, text, pattern ):
if type( text ) != type( '' ): # then assume a file
text = text.read()
return pattern.findall( text )
def rip_href( self, text ):
"""
Extract all 'href=""' targets from text.
"""
return self._rip( text, self.href )
def rip_a_href( self, text ):
"""
Extract all '<a href=""' targets from text.
"""
return self._rip( text, self.a_href )
def rip_img_src( self, text ):
"""
Extract all '<img src=""' targets from text.
"""
return self._rip( text, self.img_src )
def rip_link_href( self, text ):
"""
Extract all '<link href=""' targets from text.
"""
return self._rip( text, self.link_href )
_ripper = None
def rip_links( text ):
global _ripper
if _ripper is None:
_ripper = LinkRipper()
links = _ripper.rip_a_href( text )
parsed = map( urlparse.urlparse, links )
return map( urlparse.urlunparse, parsed )
if __name__ == '__main__':
import sys, urlparse
_ripper = LinkRipper()
for link in _ripper.rip_a_href( sys.stdin ):
print urlparse.urlparse( link )
> If not, is there some generic Zope link checking code I can
> leverage? Likewise, if anyone has any design advice on the
> best way to do this or things to watch out for, I'd appreciate
> it. If I need to write something, it would be good if it was
> general enough for others to use.
I hope that is enough for a start; I am unlikely to be able to
push it very far myself in the near term, but would be glad to
add a contributed 'portal_links' tool to the core.
Tres.
--
===============================================================
Tres Seaver tseaver@zope.com
Zope Corporation "Zope Dealers" http://www.zope.org