I have never been much of a regex master, and I am having difficulty constructing one that should be fairly simple. I want to find all the text that is between <body> and </body> in a variable that may (and probably will) contain newlines. I am removing case insensitive searches in the following examples in order to make the regex's simpler. To grab find the opening <body> tag, I have '<[\t ]*body[\t ]*.*>' which is almost correct, but not quite. This expression finds <bodystuff>, too, so I really need something that finds '<\t ]*body(\t ]+.*>)|(>)', but I can't find a construct that works. Basically, it needs at least one whitespace followed by stuff followed by '>', or else it needs no whitespace followed by '>' I can use similar code to find the </body> tag. However, putting those two together around a \(\(.*\n*\)*\), in order to match all the text between the <body> and </body> tags sends python into an infinite loop. It doesn't like it when I try to match an unlimited number of lines.before the </body> tag I could use two regsub.split calls to break the variable into its respective parts (assuming I can sort out the first problem), but regsub is written in python, while regex is written in c (or so I am told), so I would prefer to use regex. Please help. Generally, whenever I ask a mailing list for regex help, it always turns out to be somehing boneheaded that I am missing, so try not to laugh at me ;-) --sam
Sam Gendler wrote:
I have never been much of a regex master, and I am having difficulty constructing one that should be fairly simple. I want to find all the text that is between <body> and </body> in a variable that may (and probably will) contain newlines. I am removing case insensitive searches in the following examples in order to make the regex's simpler.
To grab find the opening <body> tag, I have '<[\t ]*body[\t ]*.*>' which is almost correct, but not quite. This expression finds <bodystuff>, too, so I really need something that finds '<\t ]*body(\t ]+.*>)|(>)', but I can't find a construct that works. Basically, it needs at least one whitespace followed by stuff followed by '>', or else it needs no whitespace followed by '>'
I can use similar code to find the </body> tag.
However, putting those two together around a \(\(.*\n*\)*\), in order to match all the text between the <body> and </body> tags sends python into an infinite loop. It doesn't like it when I try to match an unlimited number of lines.before the </body> tag
OK, I solved this one. I can now determine the difference between <bodykjhsd> and <body kjhsd> I gave up on doing it correctly. I am now compiling two different regex's, one that finds the <body> tag, and one that finds the </body> tag. I use object.regs[index] to then splice the string into the correct substring. UGLY. --sam
Here's the pattern I use: _find_body_re = re.compile( '<body[^>]*>\s*(?P<body>.*?)\s*</body>', re.IGNORECASE| re.MULTILINE | re.DOTALL) You can then say: matchObj = _find_body_re.search(textString) bodyString = matchObj.group('body') Doug ----------------------------------------------------------------------------- Doug Hellmann Healtheon / WebMD Software Engineer http://www.webmd.com hellmann@gnncast.net 404.541.2021
participants (2)
-
Doug Hellmann -
Sam Gendler