easy regular expression for URL fixup
Hey! I'm working up a quick re to give me the folder above a webpage... For instance: ### I want: http://www.the.net/bigfolder/ ### import re url = "http://www.the.net/bigfolder/somepage.html" htmlfile = re.compile("/\w*\.html") htmlfile.match(href_url) if htmlfile: folder_url = htmlfile.sub(href_url, "/") For some reason I cannot get my re to do this right... I goit a bunch of hits while searching that indicated many people were having trouble with re in 2.2 python. Is this the case? Anyone know what the syntax is to compile that re? TIA -ed-
[Ed Colmar]
I'm working up a quick re to give me the folder above a webpage... For instance:
### I want: http://www.the.net/bigfolder/ ### import re url = "http://www.the.net/bigfolder/somepage.html" htmlfile = re.compile("/\w*\.html") htmlfile.match(href_url) if htmlfile: folder_url = htmlfile.sub(href_url, "/")
For some reason I cannot get my re to do this right...
Double each backslash, otherwise Python interprets them and removes them. This isn't new behavior, though, been around for years. Also, htmllfile.match should return a match object, so you really want something like: m=htmlfile.match(href_url) if m: ..... Cheers, Tom P
Hey Tom Thanks for the reply... Those backslashes are for escaping the special characters (\w and .). Do they need to be doubled in this case? This still is not working for me ### I want: http://www.the.net/bigfolder/ ### import re url = "http://www.the.net/bigfolder/somepage.html" htmlfile = re.compile("/\\w*\\.html") m = htmlfile.match(url) if m: folder_url = htmlfile.sub(url, "/") I'm also trying different variations to try and get a match. None of these are working either: htmlfile = re.compile("/.*$") (this one should really be working yes?) htmlfile = re.compile("[a-z]*$") htmlfile = re.compile("\w*$") the only match I can make is this (which will match anything): htmlfile = re.compile(".*$")
[Ed Colmar]
Hey Tom
Thanks for the reply...
Those backslashes are for escaping the special characters (\w and .). Do they need to be doubled in this case?
Yes, they are for escapeing the special characters once they get to the regular expression, but they have to get there first. They have to be doubled, or an alternative is HTMLFILE=r'/\\w*\\.html' htmlfile=re.compile(HTMLFILE) Here the "r" indicates for Python to use the "raw" string, and not to excape the backslashes (at least it used to be this way - I'm not quite sure about 2.2).
This still is not working for me
### I want: http://www.the.net/bigfolder/ ### import re url = "http://www.the.net/bigfolder/somepage.html" htmlfile = re.compile("/\\w*\\.html") m = htmlfile.match(url) if m: folder_url = htmlfile.sub(url, "/")
I'm also trying different variations to try and get a match. None of these are working either: htmlfile = re.compile("/.*$") (this one should really be working yes?) htmlfile = re.compile("[a-z]*$") htmlfile = re.compile("\w*$")
the only match I can make is this (which will match anything): htmlfile = re.compile(".*$")
I suggest you do print url matches=htmlfile.findall(url) print matches or from pprint import pprint pprint(matches) You can best work this out in regular python, then copy the working code into your Zope script. This will show you exactly what the match found. Regular expressions are notoriously hard to get working right (not Python's fault, that's just how they are), don't feel bad. You need to get more systematic about debugging - check every step of the way to make sure you understand what is going on, and read the docs for the re library. Cheers, Tom P
Hey Tom Thanks again... I'm trying these out in python for checking... Still I cannot get any matches... I am really wondering if my install is botched. Does this work on yours?
import re HTMLFILE=r'/\\w*\\.html' htmlfile=re.compile(HTMLFILE) url = "http://www.somewhere.com/folder/test.html" m = htmlfile.match(url) print m None
HTMLFILE=r'/.*\.html' htmlfile=re.compile(HTMLFILE) m = htmlfile.match(url) print m None
HTMLFILE=r'[a-z]*\.html' htmlfile=re.compile(HTMLFILE) m = htmlfile.match(url) print m None
HTMLFILE=r'.*\.$' htmlfile=re.compile(HTMLFILE) m = htmlfile.match(url) print m None
HTMLFILE=r'.*' htmlfile=re.compile(HTMLFILE) m = htmlfile.match(url) print m <_sre.SRE_Match object at 0x00AB2110>
HTMLFILE=r'.*\.html' htmlfile=re.compile(HTMLFILE) m = htmlfile.match(url) print m <_sre.SRE_Match object at 0x00ABCEB0>
[Ed Colmar]
Hey Tom
Thanks again...
I'm trying these out in python for checking... Still I cannot get any matches... I am really wondering if my install is botched. Does this work on yours?
import re HTMLFILE=r'/\\w*\\.html' htmlfile=re.compile(HTMLFILE) url = "http://www.somewhere.com/folder/test.html" m = htmlfile.match(url) print m None
Regular expressions for fully general urls are hard, as you are finding out. 1) Either use r'...' syntax or double the backslashes, but don't do both. 2) Remember that the path includes "/" characters, which \w does not. 3) You may want to use non-greedy matches (see the docs). 4) What are you really trying to do? There might be an easier way. Do you want to get the path the the object, the object name at the end of the path, or what? If you want the name at the end of the url (and the url will not have a query string or a fragment identifier): import string url = "http://www.somewhere.com/folder/test.html" split=string.split(url,'/') print split[-1] # prints test.html To get the path up to but not including the final object name: split=string.split(url,'/') print string.join(split[:-1],'/') # prints http://www.somewhere.com/folder To get the path upto but but not including the ".html": split=string.split(url,'.') print string.join(split[:-1],'.') # prints http://www.somewhere.com/folder/test So it may be a lot easier to use string functions, depending on what you want to do. In dtml, you use _string to use the string module. Cheers, Tom P
participants (3)
-
Ed Colmar -
Ed Colmar -
Thomas B. Passin