HTML parsers and Wget like function

older
ExternalEditor with apache rewrite

Grant Morganryuuguu

1 Jul 2004 1 Jul '04

11:02 a.m.

I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything. Thanks, Grant

Show replies by date

Dennis Allison

1 Jul 1 Jul

12:11 p.m.

New subject: [Zope] HTML parsers and Wget like function

Python's library provides urllib and htmllib. You'll probably want to write your method either as an external method or a product. On Thu, 1 Jul 2004, Grant Morganryuuguu wrote:

...

I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything.

Thanks, Grant _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )

Anthony Baxter

1:21 p.m.

New subject: [Zope] HTML parsers and Wget like function

On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu <grant@ryuuguu.com> wrote:

...

I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago)

BeatifulSoup, ClientCookie and ClientForm together make a very very nice webscraping package.

Paul Winkler

3:11 p.m.

New subject: [Zope] HTML parsers and Wget like function

there's also KebasData, although I don't think it does much in the way of rewriting of the retrieved content. Warning though - with any of these solutions, you will want to test what happens when the remote resource is unavailable - e.g. very slow to respond, or blocked by a firewall, etc. For example: I had an external method using urllib2 to retrieve data from another server and embed it in a zope page. This worked fine until something went wonky on the network and requests to the remote page would never yield any response. The result was that requests to my zope page would hang forever. And apparently urllib2 blocks while waiting for a response, so once there were a few requests to this page I had all my worker threads blocked there. zope was effectively dead. I used the "Debug spinning zope" recipe to diagnose that all the threads were waiting in urllib2. I changed this to instead use LocalFS pointing at copies of the data on the hard drive, which are updated periodically via cron & wget. A quick hack but it fixed the symptom. This was all zope 2.6.2 / python 2.1.3. Now in python 2.3 you can set timeouts via socket.setdefaulttimeout() and this should (I hope) affect urllib2, but I have not tested it. On Thu, Jul 01, 2004 at 11:21:07PM +1000, Anthony Baxter wrote:

...

On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu <grant@ryuuguu.com> wrote:

...
I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago)

BeatifulSoup, ClientCookie and ClientForm together make a very very nice webscraping package. _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )

-- Paul Winkler http://www.slinkp.com

Grant Morganryuuguu

2 Jul 2 Jul

1:49 a.m.

New subject: [Zope] HTML parsers and Wget like function

Thanks for all the fast replies. I am starting with urllib and urllib2 and will put some timeout code guards in. I started with a script straight from the examples import urllib2 f = urllib2.urlopen('http://www.python.org/') print f.read(100) which works fine running from the command line directly with python but I get an Error Type:Unauthorized Error Value:You are allowed to run 'urlopen' in this context When I test the script. I am logged in as the manager. Zope is running under the same user that can run the script in python. Is there some permissions I have to add to the manager to run arbitrary python libraries? Thanks, Grant

Grant Morganryuuguu

2:07 a.m.

New subject: [Zope] HTML parsers and Wget like function

Found the answer-I need to use an external method to import urllib. Sorry for the noise on the list. Grant

...

Thanks for all the fast replies. I am starting with urllib and urllib2 and will put some timeout code guards in. I started with a script straight from the examples

import urllib2 f = urllib2.urlopen('http://www.python.org/') print f.read(100)

which works fine running from the command line directly with python but I get an Error Type:Unauthorized Error Value:You are allowed to run 'urlopen' in this context

When I test the script. I am logged in as the manager. Zope is running under the same user that can run the script in python.

Is there some permissions I have to add to the manager to run arbitrary python libraries?

Bakhtiar A Hamid

1 Jul 1 Jul

3:36 p.m.

New subject: [Zope] HTML parsers and Wget like function

On Thu, 01 Jul 2004 20:02:02 +0900, Grant Morganryuuguu wrote

...

I am considering Zope/python for a project and would like to get some pointers to see if this is a reasonable fit. I need to get a URL from the web, parse the HTML ,extract some data from the page, rewrite the <a href> tags and display it on the website. I found the HTML parser in library http://www.python.org/doc/current/lib/markup.html and http://www.crummy.com/software/BeautifulSoup/ (which is down now but was up a couple of days ago) does anyone have any other suggestions for manipulating HTML in Zope/python. For getting the the page from a URL is there something like Wget (unix program) in Zope for this - I searched around the manual but did not see anything.

there's KebasData (http://www.zope.org/Members/kedai/KebasData) it can scrape pages, parse for what ever, but the regex may be a bit of a head spinner. so a regex tool would help (kde has one, there's one for bash, iirc, etc) rewriting url can be done in the render_method. a bit tricky, since the original can change anytime it's not great code, but works for me. cookies are not there yet. so is using python own socket.timeoutsocket(). kebasdata was written a while back, whne there was no timeout support in python core; so i used timeoutsocket to ..er.. timeout .. :P soon, methinks

...

Thanks, Grant _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )

-- NSTP (M) BHD

7947

Age (days ago)

7948

Last active (days ago)

List overview

6 comments

5 participants

participants (5)

Anthony Baxter
Bakhtiar A Hamid
Dennis Allison
Grant Morganryuuguu
Paul Winkler