A new python object which analyse HTML files and... - Zope - Zope lists

newer
Zope-Edu Wiki

A new python object which analyse HTML files and...

older
Re: [Zope] Re: [Zope-Annce] ANN:...

Frederic QUIN

29 May 2000 29 May '00

5:17 p.m.

Hello everybody, I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them... Thanks Frederic

Reply

Sign in to reply online Use email software

Show replies by date

Andrew Wilcox

29 May 29 May

4:40 p.m.

New subject: [Zope] A new python object which analyse HTML files and...

At 06:17 PM 5/29/00 +0100, Frederic QUIN wrote:

Hello everybody,

I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects

Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them...

Thanks Frederic

You might try sgmllib, a sgml parser from the Python library. (There's also an HTML parser called htmllib derived from sgmllib, but it tries to understand all the common HTML tags such as H1, etc. If you just want to pass through all tags except all couple specific ones, I found it easier to use sgmllib). The sgmllib library, but doesn't produce output, so the hooks you'd add would be ones to recreate your HTML files from the results of the parsing. Something like the following: class MyParser (SGMLParser): def __init__(self): SGMLParser.__init__(self) self._result = '' def _write(self, data): self._result = self._result + data def getResult(self): return self._result def unknown_starttag(self, tag, attributes): r = '<' + tag for attribute in attributes: (name, value) = attribute r = r + ' ' + name + '="' + value + '"' r = r + '>' self._write(r) def unknown_endtag(self, tag): self._write('</' + tag + '>') def handle_data(self, data): self._write(data) def handle_charref(self, ref): self._write('&#' + ref + ';') def handle_entityref(self, ref): self._write('&' + ref + ';') def handle_comment(self, comment): self._write('') Then you can add specific handlers to do special things with particular tags: def do_img(attributes): ...write out special DTML code here...

Reply

Sign in to reply online Use email software

Oleg Broytmann

4:41 p.m.

New subject: [Zope] A new python object which analyse HTML files and...

On Mon, 29 May 2000, Frederic QUIN wrote:

I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects

Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them...

You may start to adapt http://www.zope.org/Members/itamar/load_site to your needs. Generic HTML parser is already there. Oleg. (All opinions are mine and not of my employer) ---- Oleg Broytmann Foundation for Effective Policies phd@phd.russ.ru Programmers don't die, they just GOSUB without RETURN.

Reply

Sign in to reply online Use email software

Itamar Shtull-Trauring

4:59 p.m.

New subject: [Zope] A new python object which analyse HTML files and...

Oleg Broytmann wrote:

...
I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects

You may start to adapt http://www.zope.org/Members/itamar/load_site to your needs. Generic HTML parser is already there.

The latest version (1.4.0) stops parsing the HTML after the <body> tags because illegal HTML (e.g. because of embedded DTML tags) would cause problems. You could turn it back on though if we assume there's no DTML. And if you do decide to do it add converting file.htm to file_html to your list - it was one of the feature requests I don't have time to do and would seem to fit in your list. -- Itamar S.T. itamar@maxnm.com

Reply

Sign in to reply online Use email software

9443

Age (days ago)

9443

Last active (days ago)

3 comments

4 participants

tags

participants (4)

Andrew Wilcox
Frederic QUIN
Itamar Shtull-Trauring
Oleg Broytmann