At 06:17 PM 5/29/00 +0100, Frederic QUIN wrote:
Hello everybody,
I would like to create a python object which : * analyse traditional HTML files * indexe "IMG" tags and "A" tags * replace "IMG" tags by appropriated dtml tags * replace "A" tags by appropriated dtml tags * create all the resultant objects
Did someone ever do that ? Anyway, if someone with more experience than me, have some advices, I'll get them...
Thanks Frederic
You might try sgmllib, a sgml parser from the Python library. (There's also an HTML parser called htmllib derived from sgmllib, but it tries to understand all the common HTML tags such as H1, etc. If you just want to pass through all tags except all couple specific ones, I found it easier to use sgmllib). The sgmllib library, but doesn't produce output, so the hooks you'd add would be ones to recreate your HTML files from the results of the parsing. Something like the following: class MyParser (SGMLParser): def __init__(self): SGMLParser.__init__(self) self._result = '' def _write(self, data): self._result = self._result + data def getResult(self): return self._result def unknown_starttag(self, tag, attributes): r = '<' + tag for attribute in attributes: (name, value) = attribute r = r + ' ' + name + '="' + value + '"' r = r + '>' self._write(r) def unknown_endtag(self, tag): self._write('</' + tag + '>') def handle_data(self, data): self._write(data) def handle_charref(self, ref): self._write('' + ref + ';') def handle_entityref(self, ref): self._write('&' + ref + ';') def handle_comment(self, comment): self._write('<!--' + comment + '-->') Then you can add specific handlers to do special things with particular tags: def do_img(attributes): ...write out special DTML code here...