How extensively is STX actually used? I've been looking at it myself recently, and the whole system seem rather simplistic to me in how it parses the text. I'm talking specifically of the STX version currently standard in Zope 2.5 (and 2.4 I think), which I believe is STXNG; I haven't gone back to look at how the previous version worked. I have looked at some of the past mailing list posts on this topic, and although it is clear that things have been improved a lot, I am surprised that not more was done so far. I explain the problems I see next, followed by a proposed algorithm change and some rough code to make things "better". I'm doing this now for two reasons: First, if I'm missing something important wrt the reasons things are as they are, I'm obviously interested to know before I spend any more time on this. Second, maybe these proposed changes are actually a step in the right direction, or might help someone else do what they need, so I'm providing the code of what I've done so far, as is, in case it can be of help (I am unlikely to have the time to come up with a polished set of changes myself any time soon). Or I guess what I'm saying is (wink wink nudge nudge): If someone else feels like picking up on this and finishing it up so I don't have to, feel free ;-) The biggest problem I see is that the various text types are given somewhat arbitrary preference from the order in which they appear in the text_types list. Given that the patterns in text_types are looked at in order, with the first match breaking the raw string in half, then any other structure that would have spanned a larger part of the string but is lower in the text_types list is effectively ignored. For example, since doc_strong is currently listed before doc_emphasize, "*emphasized **strong** emphasized*" does enclose "strong" in <strong> tags, but completely ignores the single *'s because they do not form a matching pair (because the parsing of **strong** breaks the rest in two separate strings "*emphasized " and " emphasized*"). Following the same reasoning, I assumed that **strong *emphasized* strong** would work better, but it did not! This time, it's because the regexp for doc_strong is rather simplistic, as it does not allow ANY '*' within (strongem_punc), whereas it should only care to not allow the specific pattern '**' inside. In this case, wouldn't the easiest solution be to simply use the non-greedy matching? i.e replace: r'\*\*([%s%s%s\s]+?)\*\*' % (letters, digits, strongem_punc) with: r'\*\*(.*?)\*\*' or actually better: r'(?!\*\*\*)\*\*(?<!\*\*\*)(.*?)(?!\*\*\*)\*\*(?<!\*\*\*)' I think the last pattern is best because it will not recognize the middle **** as anything in "**this: **** does not matter**", as I think should normally be expected. BTW, I make no claim that the regexp above is either the most elegant or the most efficient; this is the first one I came up with that did what I wanted ;-) Now, back to the problem with the ordered nature of text_types (the reason "*emphasized **strong** emphasized*" does not work as expected). Besides the extra computing required, any reason why the structures with the largest span shouldn't be recognized first, regardless of the order of text_types? I.E. what I propose is to go through all the text_types, collecting the matching patterns, and only once this is done choose the one with the largest span. Then proceed recursively with the enclosed text until no pattern matches. This permits to succeed in quoting structured text patterns: "'**not bold**'", and bolding quoted text: "** some text '<this is quoted>' **". With the current implementation, none of those work, i.e. "'**not bold**'" ends up being bolded and not quoted, and "** some text '<this is quoted>' **" is a total mess because the text in <> is interpreted as SGML instead of being quoted as requested. The changes I have made so far (all in DocumentClass.py): The simplest are the few regexp changes I have made for doc_strong, doc_emphasize, and doc_literal (actually doc_literal probably doesn't matter (I used the pattern proposed by someone else on this list to make the quoting more obvious), but the changes to doc_strong and doc_emphasize are required to make my other changes work). doc_strong becomes: r'(?!\*\*\*)\*\*(?<!\*\*\*)(.*?)(?!\*\*\*)\*\*(?<!\*\*\*)' doc_emphasize becomes: r'(?!\*\*)\*(?<!\*\*)(.*?)(?!\*\*)\*(?<!\*\*)' and doc_literal becomes: r"(\W+|^)``([%s%s%s\s]+)''([%s]+|$)" % (letters, digits, literal_punc, phrase_delimiters) The big changes are to the parse and color_text methods. Parse now only returns the first match found of a type rather than all of them, and it returns the start and end indices of the match so that the span size can be computed in color_text: def parse(self, raw_string, text_type, type=type, st=type(''), lt=type([])): """ Parse accepts a raw_string, an expr to test the raw_string, and the raw_string's subparagraphs. Parse will continue to search through raw_string until all instances of expr in raw_string are found. NOT!!! If no instances of expr are found, raw_string is returned. Otherwise a list of substrings and A SINGLE instance is returned """ tmp = [] # the list to be returned if raw_string is split append=tmp.append if type(text_type) is st: text_type=getattr(self, text_type) start = end = 0 # because I'm returning those now while 1: t = text_type(raw_string) if not t: break #an instance of expr was found t, start, end = t if start: append(raw_string[0:start]) tt=type(t) if tt is st: # if we get a string back (when would this happen?), add it to text to be parsed raw_string = t+raw_string[end:len(raw_string)] # should I break or not here? If I break, same as removing the while else: if tt is lt: # is we get a list, append it's elements tmp[len(tmp):]=t else: # normal case, an object append(t) #Do not keep processing once a match found! raw_string = raw_string[end:len(raw_string)] break if not tmp: return (raw_string,0,0) # nothing found if raw_string: append(raw_string) elif len(tmp)==1: return (tmp[0],start,end) return (tmp,start,end) In color_text, instead of looping over the text_types only once, I loop over all of them for every recursive pass. For any pass, I select the match (no matter what type) with the largest span, and recurse on its content. I think that the code could be made a lot more efficient (as long as they are not overlapping, I should be able to collect more than one match in a single pass, among other things), but I just wanted to see if the result of the parsing would give the results I wanted for now, and it seems it does (but I repeat, I haven't tested extensively so far). def color_text(self, str, types=None): """Search the paragraph for each special structure """ if types is None: types=self.text_types if type(str) is StringType: max = 0 parsed = 0 for text_type in types: res, start, end = self.parse(str,text_type) if res != str: parsed = 1 # keep the option with the largest span only if end-start >= max: finalres = res max = end-start if parsed and type(finalres) is ListType: # *** this may cause other problems ** return self.color_text(finalres) else: return finalres # end recursion elif type(str) is ListType: res = [] for sub in str: subres = self.color_text(sub) if type(subres) is not ListType: subres = [subres] res += subres return res else: res = map(self.color_text,str.getColorizableTexts()) # To avoid stuff like StructuredTextSGML(StructuredTextSGML('<I>')) if len(res) == 0 or type(res[0]) != type(str) or \ res[0].getColorizableTexts() != str.getColorizableTexts(): str.setColorizableTexts(res) return str So there you have it. I find that the results produced by this code make a lot more sense than what is produced by the current implementation. I guess one problem with structured text may be that there are differences of opinion as to what the actual rules and output should be (e.g. should list items be singly spaced or double spaced), but I really don't see who could argue that a preference in the order of markup like emphasis, strong, and underline makes any sense. Or am I missing something? Cheers, Jean