Non-ASCII characters in URLs
Hi, Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs? I'd like to be able to let URLs work like this example from Wikipedia: http://ja.wikipedia.org/wiki/メインページ When I try adding an object with ID "メインページ" in Zope 2, I get the following error message: Error Type: BadRequest Error Value: The id "メインページ" contains characters illegal in URLs. Is there a fundamental reason (ie. Python objects can only be ASCII) or is it simply bugs that need to be fixed? Curiously yours, -- Alexander Limi · http://limi.net
On Sun, Apr 06, 2008 at 04:37:22PM -0700, Alexander Limi wrote:
Hi,
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
I suspect it's only for hysterical raisins. The code in question is in OFS/ObjectManager.py, in the checkValidId() function. Non-ASCII characters trigger a match on the bad_id regular expression search. As I recall, if you look at the revision history, that code is very old. There might even be an existing bug filed about this; I don't remember. -- Paul Winkler http://www.slinkp.com
--On 6. April 2008 16:37:22 -0700 Alexander Limi <limi@plone.org> wrote:
Hi,
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
I'd like to be able to let URLs work like this example from Wikipedia:
http://ja.wikipedia.org/wiki/メインページ
When I try adding an object with ID "メインページ" in Zope 2, I get the following error message:
Error Type: BadRequest Error Value: The id "メインページ " contains characters illegal in URLs.
Is there a fundamental reason (ie. Python objects can only be ASCII) or is it simply bugs that need to be fixed?
As Paul indicated: the issue dates back to the times when there was only ASCII in the URL world. Especially object IDs have to be ascii - well...Zope came from US :-) Andreas
On Mon, Apr 7, 2008 at 1:37 AM, Alexander Limi <limi@plone.org> wrote:
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
Because URLs don't allow non-ASCII characters?
I'd like to be able to let URLs work like this example from Wikipedia:
Your browser translates that into http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E...
Is there a fundamental reason (ie. Python objects can only be ASCII) or is it simply bugs that need to be fixed?
RFC 1738 (http://www.ietf.org/rfc/rfc1738.txt) doesn't allow non-ascii characters in URLs. No corresponding graphic US-ASCII: URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. Now, Zope could well support UTF-8 ids, and translate URLs appropriately, but in the meantime you could use the same scheme? -- Martijn Pieters
----- Original Message ----- From: "Martijn Pieters" <mj@zopatista.com> To: "Alexander Limi" <limi@plone.org> Cc: <zope-dev@zope.org> Sent: Monday, April 07, 2008 4:39 AM Subject: Re: [Zope-dev] Non-ASCII characters in URLs
On Mon, Apr 7, 2008 at 1:37 AM, Alexander Limi <limi@plone.org> wrote:
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
Because URLs don't allow non-ASCII characters?
I'd like to be able to let URLs work like this example from Wikipedia:
Your browser translates that into http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E...
Is there a fundamental reason (ie. Python objects can only be ASCII) or is it simply bugs that need to be fixed?
RFC 1738 (http://www.ietf.org/rfc/rfc1738.txt) doesn't allow non-ascii characters in URLs.
No corresponding graphic US-ASCII:
URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.
Now, Zope could well support UTF-8 ids, and translate URLs appropriately, but in the meantime you could use the same scheme?
IDNA (http://www.ietf.org/rfc/rfc3490.txt) and Punycode (http://www.faqs.org/rfcs/rfc3492.html) may be of some use. Jonathan
On Mon, 07 Apr 2008 05:32:17 -0700, Jonathan <dev101@magma.ca> wrote:
IDNA (http://www.ietf.org/rfc/rfc3490.txt) and Punycode (http://www.faqs.org/rfcs/rfc3492.html) may be of some use.
I'm not looking for non-ASCII domain names, just object IDs. :) -- Alexander Limi · http://limi.net
Martijn Pieters wrote at 2008-4-7 10:39 +0200:
On Mon, Apr 7, 2008 at 1:37 AM, Alexander Limi <limi@plone.org> wrote:
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
Because URLs don't allow non-ASCII characters?
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids. And, in fact, there are only two reasons: * lazyness of the Zope developpers: without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification) * there is no way to specify the encoding used for non ASCII characters. HTML 4 suggests to convert non ASCII characters first to UTF-8 and then url escape the result but most HTTP clients do not follow this suggestion. Instead, they use the charset found one the page that cause them to construct the uri. I have observed that MS WebDAV from some WebDAV commands transfers the url as given and for some other commands recodes them into utf-8. Thus, supporting non ASCII ids occationally may cause surprises. -- Dieter
Previously Dieter Maurer wrote:
Martijn Pieters wrote at 2008-4-7 10:39 +0200:
On Mon, Apr 7, 2008 at 1:37 AM, Alexander Limi <limi@plone.org> wrote:
Is there a good technical explanation for why Zope doesn't allow non-ASCII characters in URLs?
Because URLs don't allow non-ASCII characters?
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids.
And, in fact, there are only two reasons:
* lazyness of the Zope developpers:
without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification)
This is becoming increasingly painful: it means we can't really use Active Directory's ObjectGUID as userid, it breaks with LDAP DN's with non-ASCII characters (all too common). I really wish Zope ID's were either binary strings or unicode strings.
* there is no way to specify the encoding used for non ASCII characters.
HTML 4 suggests to convert non ASCII characters first to UTF-8 and then url escape the result but most HTTP clients do not follow this suggestion. Instead, they use the charset found one the page that cause them to construct the uri.
I have observed that MS WebDAV from some WebDAV commands transfers the url as given and for some other commands recodes them into utf-8.
Thus, supporting non ASCII ids occationally may cause surprises.
You mean non ASCII URI's, not non ASCII ids here I suspect. Somehow I'm not surprised those are painful :( Wichert. -- Wichert Akkerman <wichert@wiggy.net> It is simple to make things. http://www.wiggy.net/ It is hard to make things simple.
Wichert Akkerman wrote at 2008-4-7 20:45 +0200:
...
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids.
And, in fact, there are only two reasons:
* lazyness of the Zope developpers:
without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification)
This is becoming increasingly painful
I will soon have a patch against Zope 2.11b1 which gets rid of this restriction. If there is consense, I can add it to the Zope repository.
...
* there is no way to specify the encoding used for non ASCII characters.
HTML 4 suggests to convert non ASCII characters first to UTF-8 and then url escape the result but most HTTP clients do not follow this suggestion. Instead, they use the charset found one the page that cause them to construct the uri.
I have observed that MS WebDAV from some WebDAV commands transfers the url as given and for some other commands recodes them into utf-8.
Thus, supporting non ASCII ids occationally may cause surprises.
You mean non ASCII URI's, not non ASCII ids here I suspect. Somehow I'm not surprised those are painful :(
No, I mean non-ASCII ids. They lead to uris with some escaped characters and MS WebDAV for some commands unescapes the uris, interprets them in some default charset ("windows-1252" in our case), recodes them in utf-8, escapes them again and then uses them in the commands. Examples are the COPY and MOVE commands. If an object has a non ASCII charater in its id, say "tüv", its url may look like "http:.../t%FCv". Used in a "COPY" or "MOVE", it is however represented as "http:.../t%C2%BCb". -- Dieter
On Mon, 07 Apr 2008 12:45:00 -0700, Dieter Maurer <dieter@handshake.de> wrote:
Wichert Akkerman wrote at 2008-4-7 20:45 +0200:
This is becoming increasingly painful
I will soon have a patch against Zope 2.11b1 which gets rid of this restriction.
If there is consense, I can add it to the Zope repository.
I would love to see support for non-ASCII object IDs, +1. (obviously not based on any technical understanding from my side :) -- Alexander Limi · http://limi.net
Dieter Maurer wrote:
Wichert Akkerman wrote at 2008-4-7 20:45 +0200:
...
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids.
And, in fact, there are only two reasons:
* lazyness of the Zope developpers:
without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification) This is becoming increasingly painful
I will soon have a patch against Zope 2.11b1 which gets rid of this restriction.
If there is consense, I can add it to the Zope repository.
+1 from my side. Saves me the work to cleanup my own dirty patch :-))
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tino Wildenhain wrote:
Dieter Maurer wrote:
Wichert Akkerman wrote at 2008-4-7 20:45 +0200:
...
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids.
And, in fact, there are only two reasons:
* lazyness of the Zope developpers:
without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification) This is becoming increasingly painful I will soon have a patch against Zope 2.11b1 which gets rid of this restriction.
If there is consense, I can add it to the Zope repository.
+1 from my side. Saves me the work to cleanup my own dirty patch :-))
- -1 without *careful* analysis of how the patch is going to break existing applications which rely on the fact that IDs are only ASCII (and therefore don't need to be quoted). At a minimum, this kind of change is going to require documenting the risks, and getting soem feedback, before any merge to a production release. Please check the patch in on a "private" branch and ask for comments here. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIAWVM+gerLs4ltQ4RAhsAAKDCLcUAb+ZzzYBJZ2OdoZeDKQ49MwCbBpNH r7gkEMLDz/mzfyCoyMoHgZc= =/p2I -----END PGP SIGNATURE-----
--On 12. April 2008 21:43:40 -0400 Tres Seaver <tseaver@palladion.com> wrote:
This is becoming increasingly painful I will soon have a patch against Zope 2.11b1 which gets rid of this restriction.
If there is consense, I can add it to the Zope repository.
+1 from my side. Saves me the work to cleanup my own dirty patch :-))
- -1 without *careful* analysis of how the patch is going to break existing applications which rely on the fact that IDs are only ASCII (and therefore don't need to be quoted). At a minimum, this kind of change is going to require documenting the risks, and getting soem feedback, before any merge to a production release.
Please check the patch in on a "private" branch and ask for comments here.
@Dieter: please create a branch for this (and not as patch for Launchpad) The patch is working for long time (possibly several years) within our private Zope. So I would not expect much problems. Of course it needs testing and documentation. Andreas
Tres Seaver wrote at 2008-4-12 21:43 -0400:
...
Dieter Maurer wrote:
Wichert Akkerman wrote at 2008-4-7 20:45 +0200:
...
Almost surely, Alexander wants to ask why Zope does not allow non-ASCII characters in ids.
And, in fact, there are only two reasons:
* lazyness of the Zope developpers:
without the restriction to ASCII characters careful quoting (and unquoting) is necessary in order to adhere to RFC 2396 (the modern uri syntax specification) This is becoming increasingly painful I will soon have a patch against Zope 2.11b1 which gets rid of this restriction.
If there is consense, I can add it to the Zope repository.
+1 from my side. Saves me the work to cleanup my own dirty patch :-))
- -1 without *careful* analysis of how the patch is going to break existing applications which rely on the fact that IDs are only ASCII (and therefore don't need to be quoted). At a minimum, this kind of change is going to require documenting the risks, and getting soem feedback, before any merge to a production release.
Please check the patch in on a "private" branch and ask for comments here.
Implemented on "http://svn.zope.org/Zope/branches/dm-arbitrary-ids/". -- Dieter
participants (9)
-
Alexander Limi -
Andreas Jung -
Dieter Maurer -
Jonathan -
Martijn Pieters -
Paul Winkler -
Tino Wildenhain -
Tres Seaver -
Wichert Akkerman