I want to set up a process for dumping my Zope CMF site to the filesystem, to be served by Apache. I'm interested in anyone who's doing this - what tools are you using. I'm trying Wget, but the main problem is dealing with absolute URLs. I can use the Wget --convert-links option, which removes the href attribute from the <base> tag and makes internal links relative. However, I still have a problem with folders. The absolute_url() method does not return a trailing slash for folders. Wget downloads the URL folder_name as a file called folder_name, but it downloads folder_name/ as folder_name/index.html. I have already written a relativeURL() script based on portal_url.getRelativeUrl(), but it doesn't return a trailing slash either, so I'll have to add one. Thanks, David -- David Chandek-Stark Web Applications Developer Duke University - Perkins Library (919) 660-5859 dc@duke.edu
We used wget followed by some ad hoc URL mangling. Our main problems was files without extensions. We captured the MIME type from wget's output, appended an appropriate extension and then replaced all matching URLs. Steven Steven Hayles - Computer Systems Developer, sh23@le.ac.uk Learning Technology Section, Computer Centre, University of Leicester, University Rd, Leicester, LE1 7RH Fax (0/+44)116 2525027 WWW <URL:http://www.le.ac.uk/home/sh23> On Wed, 7 Jul 2004, David Chandek-Stark wrote:
I want to set up a process for dumping my Zope CMF site to the filesystem, to be served by Apache. I'm interested in anyone who's doing this - what tools are you using. I'm trying Wget, but the main problem is dealing with absolute URLs. I can use the Wget --convert-links option, which removes the href attribute from the <base> tag and makes internal links relative. However, I still have a problem with folders. The absolute_url() method does not return a trailing slash for folders. Wget downloads the URL folder_name as a file called folder_name, but it downloads folder_name/ as folder_name/index.html. I have already written a relativeURL() script based on portal_url.getRelativeUrl(), but it doesn't return a trailing slash either, so I'll have to add one.
Thanks, David
-- David Chandek-Stark Web Applications Developer Duke University - Perkins Library (919) 660-5859 dc@duke.edu _______________________________________________ Zope maillist - Zope@zope.org http://mail.zope.org/mailman/listinfo/zope ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope-dev )
Hello David, DCS> I want to set up a process for dumping my Zope CMF site to the DCS> filesystem, to be served by Apache. I'm interested in anyone who's doing DCS> this - what tools are you using. I'm trying Wget, but the main problem DCS> is dealing with absolute URLs. I can use the Wget --convert-links DCS> option, which removes the href attribute from the <base> tag and makes DCS> internal links relative. However, I still have a problem with folders. DCS> The absolute_url() method does not return a trailing slash for folders. DCS> Wget downloads the URL folder_name as a file called folder_name, but it DCS> downloads folder_name/ as folder_name/index.html. I have already written DCS> a relativeURL() script based on DCS> portal_url.getRelativeUrl(), but it DCS> doesn't return a trailing slash either, so I'll have to add one. Recently I've done this problem. The solution is next. 1. Make all your URLs end with slash. I did it manually, by correcting some lists in portlets, and after that I found how to redefine absolute_url() function. Please, look for it here: 2. Run wget (I'm doing it from my Zope as a reaction on some user action) but it's also could be done with shell script like below. Convert links in downloaded files, erase <base ..> tag. Also I edit html files to delete 'index.html' from links - any URL now ends with '/'. (*) If you wish you may optimize file by killing white space - I found white space takes about 30-40% of html file. 3. Publish your files. Here's the script: el@test[<<debug-1/bin]%cat mirror.sh #!/bin/sh param=$1 if test "$param" = ""; then param='-r -l 1 -i ../etc/wget-list' else param="http://www.test/$param" fi wget -v -nH -k -p -X images -x -R index_html $param for i in `find ./ -name '*.html'`; do infa=`cat $i` infa=`echo $infa|sed -e 's/href="\([a-zA-Z0-9._/-]*\)\/index.html"/href="\1\/"/g' \ -e 's/="index.html"/=".\/"/g' -e 's/<base href=""[^/]*\/>/<!--here was base tag-->/'` echo $infa > $i done ====== File wget-list contains extra files need to be downloaded: el@test[<<debug-1/bin]%cat ../etc/wget-list http://www.test/ http://www.test/xtra/head.css http://www.test/xtra/default.css http://www.test/xtra/inside.css ==== Addition: (*) - It's my mania. I hate URL with a lot of junk like http://site/print1.html?foo=bar&sid=4759436545&vasya=pupkine&junks=true¬h....... The best URL is in format as supposed Tim Bernes Lee: http://site/section/subsection/page/ -- Best regards, Eugene mailto:el-spam@yandex.ru
participants (3)
-
David Chandek-Stark -
Eugene -
Steven Hayles