[Zope-CMF] some issues from the CMF Collector
Yuppie
schubbe@web.de
Thu, 16 Jan 2003 13:20:40 +0100
Hi Seb!
Thanks for the feedback.
seb bacon wrote:
>> <http://collector.zope.org/CMF/41>: CMFDefault.utils.bodyfinder
>> ---------------------------------------------------------------
>
> ...
>
>> PROPOSED SOLUTION (1):
>> Make the regex even more complex:
>> _bodyre = re.compile( r'^(\s|(<[^<>]*?>))*<html.*<body.*?>',
>> re.DOTALL | re.I )
>>
>> PROPOSED SOLUTION (2):
>> 'bodyfinder' is only useful for html documents, it should only be used
>> if we made sure we have a html document.
>>
>> QUESTIONS:
>> Solution (1) is just a one line change, (2) seems to be cleaner but
>> needs much more changes. Other ideas? Is there a way to make the regex
>> in (1) less complex? Which solution should be implemented?
>
>
> For solution (2) you'd usually end up by having to do some kind of regex
> anyway.
Meanwhile I had a closer look at this issue. There is already a function
that's trying to make sure we have a html document: html_headcheck
<code>
_htfinder = re.compile( r'<html', re.DOTALL | re.I )
def html_headcheck( html ):
""" Return 'true' if document looks HTML-ish enough.
"""
if not _htfinder.search(html):
return 0
lines = re.split(r'[\n\r]+?', html)
for line in lines:
line = line.strip()
if not line:
continue
elif line.lower().startswith( '<html' ):
return 1
elif line[0] != '<':
return 0
</code>
PUT of a document with Content-Type =! 'text/html' calls this function
already two times!!! Doing a similar check inside bodyfinder seems to be
overkill.
The current implementation of html_headcheck fails with multiline tags like
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">
So I guess the check line by line should anyway be replaced with a regex
like I proposed for bodyfinder.
Using html_headcheck before calling bodyfinder would make it possible to
run the check only once in each case.
> (1) seems fine to me, though I guess the "<html.*" bit is now redundant.
We need "<html.*" because in front of it we don't allow any text outside
the tags, behind "<html" there is title with text between the tags.
But after looking at html_headcheck I'd prefer solution 2 anyway.
Comments are welcome.
Cheers,
Yuppie