Re: [Zope-dev] Content Type Meta tag stripping in zope.pagetemplate

24 Feb 2012

      On Fri, Feb 24, 2012 at 09:57:57PM +0100, Charlie Clark wrote:
...
Am 24.02.2012, 09:47 Uhr, schrieb Miano Njoka <mianonjoka@gmail.com>:
...
While it is not essential, it is necessary in some cases where the
finished document will be read from disk or is used by other
applications eg. Deliverance[http://packages.python.org/Deliverance/].
In fact w3c's HTML validator throws a warning that one should declare
the character encoding in the document itself if it is missing.
This is actually what the validator says:
"""
No character encoding information was found within the document,
either in an HTML meta element or an XML declaration. It is often
recommended to declare the character encoding in the document
itself, especially if there is a chance that the document will be
read from or saved to disk, CD, etc.
"""
As ZPT produces XHTML the proper place for any encoding declaration
is in the XML declaration, defaulting to UTF-8, which should throw a
validation error if incorrect.
A strong -1 for zope.pagetemplate adding <?xml ... ?> declarations
automatically.
...
Like much of the HTML standard the
meta tags were never really thought through and, because invisible
to the user, all too often copied mindlessly from one project to
another: I have customers today with completely invalid and
misleading meta tags of which they and the rest of the world is
blissfully unware. And as a result browsers - the main consumers of
the format were made fault tolerant - after all the user often had
no idea what was causing the problem or how to rectify it. I have
seen many examples of the server saying one think and the meta
something else entirely. I think nearly all browsers believe what
the server says over what's in the meta tag.
The HTML spec requires that:

  "To sum up, conforming user agents must observe the following
   priorities when determining a document's character encoding (from
   highest priority to lowest):

    1. An HTTP "charset" parameter in a "Content-Type" field.
    2. A META declaration with "http-equiv" set to "Content-Type" and a
       value set for "charset".
    3. The charset attribute set on an element that designates an
       external resource."

            -- http://www.w3.org/TR/html4/charset.html#h-5.2.2

(Aside: The rationale for this ordering, IIRC, is that it allows HTTP
servers to do on-the-fly charset conversion from one 8-bit charset to a
different one, without having to parse HTML and modify the charset name
in the <meta> declaration.)
...
According to MAMA, which was instrumental in developing HTML 5 based
on what has actually been written, the charset was set in the
http-headersover 99 % of the time. Unfortunately, it doesn't contain
any stats on discrepancies between the http-header and the meta.
http://dev.opera.com/articles/view/mama
While there is apparently a possible security risk when not
declaring the charset I think the Pythonic principle of "there
should be preferably one obvious way to do something" should apply
when within Zope trying to decide the charset of a file and that
should be well documented. I'd suggest keeping the stripping but
implementing a more rigorous approach such as you suggest.
I'm not a big fan of the stripping.

Consider people using wget to mirror websites (or some equivalent way --
hitting Save As in a browser and selecting "Web Page (original)" instead
of "Web Page (complete)").  The Content-Type header is not going to be
saved on disk.

Why should zope.pagetemplate forbid programmers from duplicating the
charset information in the <meta> element, at least as long as that
information is correct (i.e. matches the content type)?

Marius Gedminas
-- 
http://pov.lt/ -- Zope 3/BlueBream consulting and development