HTML parsers and Wget like function
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything. Thanks, Grant
Python's library provides urllib and htmllib. You'll probably want to write your method either as an external method or a product. On Thu, 1 Jul 2004, Grant Morganryuuguu wrote:
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything.
Thanks, Grant _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu <grant@ryuuguu.com> wrote:
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago)
BeatifulSoup, ClientCookie and ClientForm together make a very very nice webscraping package.
there's also KebasData, although I don't think it does much in the way of rewriting of the retrieved content. Warning though - with any of these solutions, you will want to test what happens when the remote resource is unavailable - e.g. very slow to respond, or blocked by a firewall, etc. For example: I had an external method using urllib2 to retrieve data from another server and embed it in a zope page. This worked fine until something went wonky on the network and requests to the remote page would never yield any response. The result was that requests to my zope page would hang forever. And apparently urllib2 blocks while waiting for a response, so once there were a few requests to this page I had all my worker threads blocked there. zope was effectively dead. I used the "Debug spinning zope" recipe to diagnose that all the threads were waiting in urllib2. I changed this to instead use LocalFS pointing at copies of the data on the hard drive, which are updated periodically via cron & wget. A quick hack but it fixed the symptom. This was all zope 2.6.2 / python 2.1.3. Now in python 2.3 you can set timeouts via socket.setdefaulttimeout() and this should (I hope) affect urllib2, but I have not tested it. On Thu, Jul 01, 2004 at 11:21:07PM +1000, Anthony Baxter wrote:
On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu <grant@ryuuguu.com> wrote:
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago)
BeatifulSoup, ClientCookie and ClientForm together make a very very nice webscraping package. _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
-- Paul Winkler http://www.slinkp.com
Thanks for all the fast replies. I am starting with urllib and urllib2 and will put some timeout code guards in. I started with a script straight from the examples import urllib2 f = urllib2.urlopen('http://www.python.org/') print f.read(100) which works fine running from the command line directly with python but I get an Error Type:Unauthorized Error Value:You are allowed to run 'urlopen' in this context When I test the script. I am logged in as the manager. Zope is running under the same user that can run the script in python. Is there some permissions I have to add to the manager to run arbitrary python libraries? Thanks, Grant
Found the answer-I need to use an external method to import urllib. Sorry for the noise on the list. Grant
Thanks for all the fast replies. I am starting with urllib and urllib2 and will put some timeout code guards in. I started with a script straight from the examples
import urllib2 f = urllib2.urlopen('http://www.python.org/') print f.read(100)
which works fine running from the command line directly with python but I get an Error Type:Unauthorized Error Value:You are allowed to run 'urlopen' in this context
When I test the script. I am logged in as the manager. Zope is running under the same user that can run the script in python.
Is there some permissions I have to add to the manager to run arbitrary python libraries?
On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu wrote
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything.
there's KebasData (http://www.zope.org/Members/kedai/KebasData) it can scrape pages, parse for what ever, but the regex may be a bit of a head spinner. so a regex tool would help (kde has one, there's one for bash, iirc, etc) rewriting url can be done in the render_method. a bit tricky, since the original can change anytime it's not great code, but works for me. cookies are not there yet. so is using python own socket.timeoutsocket(). kebasdata was written a while back, whne there was no timeout support in python core; so i used timeoutsocket to ..er.. timeout .. :P soon, methinks
Thanks, Grant _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
-- NSTP (M) BHD
participants (5)
-
Anthony Baxter -
Bakhtiar A Hamid -
Dennis Allison -
Grant Morganryuuguu -
Paul Winkler