I have got a working solution for this. I have done this to read in user supplied datasets. The solution is ugly, slow, but works for me. The items are separated with commas, if a data item contains comma in it, it must be enclosed in '' or "", if not, enclosing is not needed. You can also include escape characters in a data item if closed, like "He said, \"Hello.\"". The function first cut each line by commas, then evaluate quoted items to its python string value, thus getting rid of quotations and escaped characters. Code below: ===================================================== import string, re from cStringIO import StringIO def cut_comma(lines): """This function is used to read comma delimited files(.csv). Input: the rows of the data file in a list 'lines'. Output: a list of list x, x[i] is the i'th row of the input, x[i][j] is the j'th data item on the i'th row. The scanning respects quoting, e.g., the row "Bond, James", 13, 45 will be recognized as three column ["Bond, James", "13", "45"] successfully. Escape characters like 'That\'s absolutely possible.', 1999, 8, 30 are treated correctly. Shortcoming: very slow when reading large datasets. On a SUN Sparc 2, reading 10000 lines of input took 20 seconds. Suggestion is welcome. Author: Li Dongfeng, ldf@statms.stat.pku.edu.cn Last modified: 1999.9.10 """ # here I used Tim Peter's sugestion, but doesn't seem any faster. r = re.compile(r""" \s* # data item can start with any number of spaces (?P<item> # start of data item we need to extract '[^'\\\n]*(?:\\.[^'\\\n]*)*' # matches any thing enclosed in '...'(can have comma) # but escaped characters are escaped. without newline. | "[^"\\\n]*(?:\\.[^"\\\n]*)*" # matches any thing enclosed in "..."(can have comma) # but escaped characters are escaped. without newline. | [^,]* ) (?:\s* , # end with a comma(can have spaces before it) ) """, re.VERBOSE) r2=re.compile(r"""^(['"]).*\1$""", re.MULTILINE) # anything quoted def quote(match): return eval(match.group()) if type(lines) is type(()) or type(lines) is type([]): # sub comma with newline, but respect string quoting x = map(lambda s, r=r: r.sub("\\g<item>\n", s), lines) # unquote all quoted x = map(lambda s, r=r2, f=quote: r.sub(f, s), x) # split the rows x = map(lambda s: string.split(s, "\n"), x) else: x = r.sub("\\g<item>\n", lines) x = r2.sub(quote, x) x = string.split(x, "\n") return x ========================================== Max M wrote:
I am trying to import a comma delimited ascii file via Python.
naturally the pattern: , will not work as there might be a comma inside the text.
the pattern "," wont work either as there as the numbers are not enclosed in quotes.
There must be a simple pattern doing it right but I cannot seem to figure it out myself.
regards ------------------------------------------------------------------------ Max M Rasmussen, New Media Director http://www.normik.dk Denmark e-mail mailto:maxm@normik.dk
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope No cross posts or HTML encoding! (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )