!hi, We are having a problem with htdig indexing Zope documents with multiple directory listings from htdig -i -vvv: href:http://dev.website.com/org/org/org/org/org/org/org/org/org/core/index_a.html resolving 'http://dev.website.com/org/org/org/org/org/org/org/org/org/org/core/index_a....' At the moment, I am referencing http://dev.ohlone.cc.ca.us/ (Zope) http://dev2.website.com/ (Apache/filesystem) in the htdig config start_url. I am using mod_rewrite in http.conf to redirect requests to dev.website.com to http://dev.website.com:8080/. The website functions exactly as expected but the indexing is extreme. I assume that this issue is associated with the zope rewrite but any info to help resolve this would be appreciated. I am also posting this to the htdig list hoping to figure out how to work around it and what the heck is going on. TIA, -Tj
On Tue, 16 Jul 2002, Tiffany Webb wrote:
!hi,
We are having a problem with htdig indexing Zope documents with multiple directory listings from htdig -i -vvv:
href:http://dev.website.com/org/org/org/org/org/org/org/org/org/core/index_a.html
resolving 'http://dev.website.com/org/org/org/org/org/org/org/org/org/org/core/index_a....'
Get rid of EVERY relative url on your site. Acquisition is very powerful but with a relative url it also means spiders get lost pretty much permantently and index things they don't know is not the same. You get the same behavior with wget where it just gets lost.
Tiffany Webb writes:
We are having a problem with htdig indexing Zope documents with multiple directory listings from htdig -i -vvv:
href:http://dev.website.com/org/org/org/org/org/org/org/org/org/core/index_a.html
Looks like a non-trivial relative URL reference. A relative URL reference is non-trivial when it contains a "/" which is not preceeded by "..". Due to acquisition, Zope resolves such URL references quite well. But, when you have a reference cycle containing one (or more) non-trivial URL references, then the URLs get longer and longer for each round through the circle. Humans finally stop turning around the circle, but spiders may be stupid... Dieter
The best bet is to stop using relative URL's in zope, i.e. use <a href="&dtml.url-yourobject;">link</a> instead of <a href="yourobject">link</a> but if that causes a lot of pain, here's a couple of other ideas: Try putting a rule in your robots.txt file such as: User-agent: * Disallow: /org/org Or, try using the max_hop_count parameter in the htdig.conf file. You'd still get some repeats, but at least it would stop at some point. This is only reliable for complete indexes rather than updates. -Paul Dieter Maurer wrote:
Tiffany Webb writes:
We are having a problem with htdig indexing Zope documents with multiple directory listings from htdig -i -vvv:
href:http://dev.website.com/org/org/org/org/org/org/org/org/org/core/index_a.html
Looks like a non-trivial relative URL reference. A relative URL reference is non-trivial when it contains a "/" which is not preceeded by "..".
Due to acquisition, Zope resolves such URL references quite well. But, when you have a reference cycle containing one (or more) non-trivial URL references, then the URLs get longer and longer for each round through the circle. Humans finally stop turning around the circle, but spiders may be stupid...
Dieter
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
participants (4)
-
Dieter Maurer -
kosh@aesaeion.com -
Paul Erickson -
Tiffany Webb