[Zope-CMF] some issues from the CMF Collector

Thu, 16 Jan 2003 13:20:40 +0100

Hi Seb!

Thanks for the feedback.

seb bacon wrote:
>> <http://collector.zope.org/CMF/41>: CMFDefault.utils.bodyfinder
>> ---------------------------------------------------------------
> 
> ...
> 
>> PROPOSED SOLUTION (1):
>> Make the regex even more complex:
>>     _bodyre = re.compile( r'^(\s|(<[^<>]*?>))*<html.*<body.*?>',
>>                           re.DOTALL | re.I )
>>
>> PROPOSED SOLUTION (2):
>> 'bodyfinder' is only useful for html documents, it should only be used 
>> if we made sure we have a html document.
>>
>> QUESTIONS:
>> Solution (1) is just a one line change, (2) seems to be cleaner but 
>> needs much more changes. Other ideas? Is there a way to make the regex 
>> in (1) less complex? Which solution should be implemented?
> 
> 
> For solution (2) you'd usually end up by having to do some kind of regex 
> anyway.

Meanwhile I had a closer look at this issue. There is already a function 
that's trying to make sure we have a html document: html_headcheck

<code>
_htfinder = re.compile( r'<html', re.DOTALL | re.I )

def html_headcheck( html ):
     """ Return 'true' if document looks HTML-ish enough.
     """
     if not _htfinder.search(html):
         return 0

     lines = re.split(r'[\n\r]+?', html)

     for line in lines:
         line = line.strip()

         if not line:
             continue
         elif line.lower().startswith( '<html' ):
             return 1
         elif line[0] != '<':
             return 0
</code>

PUT of a document with Content-Type =! 'text/html' calls this function 
already two times!!! Doing a similar check inside bodyfinder seems to be 
overkill.

The current implementation of html_headcheck fails with multiline tags like

<!DOCTYPE html PUBLIC
     "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">

So I guess the check line by line should anyway be replaced with a regex 
like I proposed for bodyfinder.

Using html_headcheck before calling bodyfinder would make it possible to 
run the check only once in each case.

> (1) seems fine to me, though I guess the "<html.*" bit is now redundant.

We need "<html.*" because in front of it we don't allow any text outside 
the tags, behind "<html" there is title with text between the tags.

But after looking at html_headcheck I'd prefer solution 2 anyway.

Comments are welcome.

Cheers,

Yuppie