Prevent recursive and multiple URLs in Zope
Hi Zope I am quite new to Zope and it is fun ;-) There are however 2 things about it I do not like very much: 1) URL trailing slash handling: http://example.com/some_doc http://example.com/some_doc/ are both valid URLs to access the method or document some_doc in the given root folder. In file-based publishing (like with apache) the second URL would be invalid, because some_doc is not a folder. http://example.com/some_folder http://example.com/some_folder/ are both valid URLs to access the folder some_folder in the root folder. Apache would allow the first URL, but would redirect to the second, because some_folder is not a document, it is a folder. 2) recursive acquisition: http://example.com/some_folder/some_folder/some_folder/some_folder/ is a valid URL to access the folder some_folder in the root folder. --- WHY do I dislike these two things? a) Philosophically: As the name "UNIQUE resource locator" already says: it is generally not good to have the same content available via different locators. b) Technically: Working with relative links becomes unreliable and dangerous. Problem #1 causes a relative URL to sometimes work and sometimes not work, depending on whether the visitor accesses "foo/bar/" or "foo/bar". Problem #2 makes relative links to be the door to infinite recursion. A simple link like "<a href="foo/">clickme</a>" will be the trap, where tumb spiders will loose themselves in a infinite loop (this was discussed shortly on this list under the subject "htdig indexing problem". --- Experiences? Since there are lots of Zope sites out there and I did not find big discussion on this matter until yet, am I maybe putting too much weight on it? --- Workarounds I still hope to find a relatively simple solution to change that behaviour. I did however only find some workarounds until now: - avoid relative URLs - work with absolute_url(), URL0, URL1 etc. instead - work with <base href=...> If my editors where all technical guys, this would be a solution (but there is still Murphy's law...). But as I know my editors, they just type something in as the link and test whether it works - and because it DOES work, they do not notice that they just opened the door to infinite recursion ;-). Other workarounds I was told: - (for problem 2): put an access-restricted subfolder with the same name into any folder - (for problem 2): disallow access to any some_folder/some_folder combinations in a robots.txt But these seems very tiresome and can only be automated with lots of work (or has somebody tried this?). -- Solution? Probably someone who knows the Zope-interna well would be easily able to create a plugin (product) which defines the following rules: - if the request-URL has a trailing slash, and the invoked object is not a folder: reponse 404 (even if generic Zope would serve an object then) - if the request-URL has no trailing slash, and the invoked object IS a folder: redirect to URL + '/' - if the acquisition path invoked by the request-URL contains multiple times an identical object: reponse 404 Does this make sense? I tried to do it using an Access Rule with SiteAccess2, but this doesn't seem to lead to a sensible solution, because an Access Rule is invoked when a folder is traversed FIRST, and in this moment it is not known which type of object the URL will call at last. So there should be something like an Access Rule to be called _at the very end_ of the traversal/acquisition process. I would be very thankful for any hint regarding this story, because it is really something that makes me a bit uneasy when starting to use Zope as my platform of choice for more complex and extended sites (which for all other reasons I would do, of course ;-) ). Kind regards, Urs ------------------------- Urs van Binsbergen van.binsbergen@taktik.ch bureau taktik GmbH Zentralstrasse 76b 8003 Zürich Telefon 01 450 34 05 -------------------------
Urs van Binsbergen writes:
... There are however 2 things about it I do not like very much:
1) URL trailing slash handling:
http://example.com/some_doc http://example.com/some_doc/ are both valid URLs to access the method or document some_doc in the given root folder. In file-based publishing (like with apache) the second URL would be invalid, because some_doc is not a folder.
http://example.com/some_folder http://example.com/some_folder/ are both valid URLs to access the folder some_folder in the root folder. Apache would allow the first URL, but would redirect to the second, because some_folder is not a document, it is a folder.
2) recursive acquisition:
http://example.com/some_folder/some_folder/some_folder/some_folder/ is a valid URL to access the folder some_folder in the root folder.
---
WHY do I dislike these two things?
a) Philosophically: As the name "UNIQUE resource locator" already says: it is generally not good to have the same content available via different locators. Maybe, your philosophical argument is weakened when you learn that URL stands for "*UNIVERSAL* resource locator".
Its a universal syntax (!) to locate a resource accessible throuch a wide variety of protocols. It is quite common to have the same resource accessed through different URLs: often the same resource can be accessed both via HTTP and FTP, often the same (local) resource can be accessed with the "file", the "ftp" and the "http" protocol, often the same resource can be accessed via both "ftp" and "webdav" (wich is HTTP based).
b) Technically: Working with relative links becomes unreliable and dangerous. Problem #1 causes a relative URL to sometimes work and sometimes not work, depending on whether the visitor accesses "foo/bar/" or "foo/bar". Only, when you do strange strings. Usually, Zope sets the HTML base tag, such that it does not matter whether the user uses "foo/bar/" or "foo/bar".
Problem #2 makes relative links to be the door to infinite recursion. A simple link like "<a href="foo/">clickme</a>" will be the trap, where tumb spiders will loose themselves in a infinite loop (this was discussed shortly on this list under the subject "htdig indexing problem". When you use relative links in the same way you are forced to do it in a file system based publishing environment, there will be no infinite recursion. Simply avoid relative links containing a "/" not preceeded by "..". Use an absolute URL otherwise.
Experiences?
Since there are lots of Zope sites out there and I did not find big discussion on this matter until yet, am I maybe putting too much weight on it? I feel you do.
Workarounds ... - work with <base href=...> This is done at automatically unless your pages are strange..
... Other workarounds I was told: - (for problem 2): put an access-restricted subfolder with the same name into any folder - (for problem 2): disallow access to any some_folder/some_folder combinations in a robots.txt You may also learn about SiteAccess AccessRules (--> documentation on Zope.org).
Solution? ... - if the request-URL has a trailing slash, and the invoked object is not a folder: reponse 404 (even if generic Zope would serve an object then) While a file system folder is a very narrow concept, there are many folder variants in Zope. In fact, most objects in Zope can act like a folder (in the sense that they support a default presentation called "index_html").
Forget about the trailing "/" problem. Give your pages an HTML "head" element (as you should anyway) and do not include a "base" tag, then Zope will put such a tag in when it modified the URL.
- if the acquisition path invoked by the request-URL contains multiple times an identical object: reponse 404 --> SiteAccess AccessRule in your root folder.
Does this make sense? Maybe for you. I would not go this way.
Dieter
participants (2)
-
Dieter Maurer -
Urs van Binsbergen