A new python object which analyse HTML files and...
Hello everybody, I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them... Thanks Frederic
At 06:17 PM 5/29/00 +0100, Frederic QUIN wrote:
Hello everybody,
I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects
Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them...
Thanks Frederic
You might try sgmllib, a sgml parser from the Python library. (There's also an HTML parser called htmllib derived from sgmllib, but it tries to understand all the common HTML tags such as H1, etc. If you just want to pass through all tags except all couple specific ones, I found it easier to use sgmllib). The sgmllib library, but doesn't produce output, so the hooks you'd add would be ones to recreate your HTML files from the results of the parsing. Something like the following: class MyParser (SGMLParser): def __init__(self): SGMLParser.__init__(self) self._result = '' def _write(self, data): self._result = self._result + data def getResult(self): return self._result def unknown_starttag(self, tag, attributes): r = '<' + tag for attribute in attributes: (name, value) = attribute r = r + ' ' + name + '="' + value + '"' r = r + '>' self._write(r) def unknown_endtag(self, tag): self._write('</' + tag + '>') def handle_data(self, data): self._write(data) def handle_charref(self, ref): self._write('' + ref + ';') def handle_entityref(self, ref): self._write('&' + ref + ';') def handle_comment(self, comment): self._write('<!--' + comment + '-->') Then you can add specific handlers to do special things with particular tags: def do_img(attributes): ...write out special DTML code here...
On Mon, 29 May 2000, Frederic QUIN wrote:
I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects
Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them...
You may start to adapt http://www.zope.org/Members/itamar/load_site to your needs. Generic HTML parser is already there. Oleg. (All opinions are mine and not of my employer) ---- Oleg Broytmann Foundation for Effective Policies phd@phd.russ.ru Programmers don't die, they just GOSUB without RETURN.
Oleg Broytmann wrote:
I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects
You may start to adapt http://www.zope.org/Members/itamar/load_site to your needs. Generic HTML parser is already there.
The latest version (1.4.0) stops parsing the HTML after the <body> tags because illegal HTML (e.g. because of embedded DTML tags) would cause problems. You could turn it back on though if we assume there's no DTML. And if you do decide to do it add converting file.htm to file_html to your list - it was one of the feature requests I don't have time to do and would seem to fit in your list. -- Itamar S.T. itamar@maxnm.com
participants (4)
-
Andrew Wilcox -
Frederic QUIN -
Itamar Shtull-Trauring -
Oleg Broytmann