[Zope-dev] redirect burps on unicode URLs

Mon Mar 1 10:40:30 EST 2010

On 03/01/2010 03:34 PM, Wichert Akkerman wrote:
> On 3/1/10 15:09 , Christian Theune wrote:
>> Hi,
>>
>> On 03/01/2010 02:28 PM, Martin Aspeli wrote:
>>>
>>> I'm with Wichert here.
>>>
>>> In most places, we tend to carry around unicode strings internally, and
>>> only encode on the boundaries, e.g. when the URL is "rendered". I don't
>>> see why redirect() can't have a sensible and predictable policy for
>>> unicode strings, making life easier for everyone.
>>>
>>> If we think that non-ASCII URLs are illegal, then maybe we should
>>> validate for that and throw an error. However, I don't think that's the
>>> case (anymore?). In that case, passing a unicode object to the function
>>> seems entirely consistent with other places, e.g. when we pass unicode
>>> to the page template engine or return unicode from a view, which the
>>> publisher then encodes before it's pushed down to the client.
>>
>> I opened a question in another part of the thread, but haven't gotten an
>> answer yet. In my understanding, a Unicode string is not able to
>> represent the structural properties of a URL in http scheme properly,
>> thus encoding back to ASCII is not possible.
>>
>> Can someone confirm or disprove this?
> 
> I am not sure what you mean. On the wire you get a path component in a 
> HTTP get request which is UTF-8 encoded and escaped. For example 
> http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 
> , which is a Japanese string if you decode it back to unicode. That 
> encoding works fine in two directions, and all other properties used in 
> the http scheme such as query strings and fragments work normally. Can 
> you provide an example of something that might not work?

The problem is that a URI has internal structure which looks to me like
it can't be reconstructed properly if it was decoded into a "regular"
unicode string.

E.g. reserved characters are probably decoded into their regular symbols
(e.g. a slash embedded in a path component or ampersands used in query
arguments), so escaping needs to be done (manually) before encoding.
Also, some parts of a URI can use other ways to encode symbols.
Hostnames would like to be encoded to punycode whereas URIs don't even
say what character set unicode characters should be encoded to. That
would be up to the application (e.g. our publisher, so that's manageable).

I have the feeling that roundtrip behaviour of URI -> unicode string ->
URI won't be possible fully correctly and thus may be susceptible to
interference from the outside.

I still hope we can do better than doing nothing about it. I just think
it's more complex than calling encode('something'). ;)

Christian

-- 
Christian Theune · ct at gocept.com
gocept gmbh & co. kg · forsterstraße 29 · 06112 halle (saale) · germany
http://gocept.com · tel +49 345 1229889 0 · fax +49 345 1229889 1
Zope and Plone consulting and development