Comma delimited file and the re python object
I am trying to import a comma delimited ascii file via Python. naturally the pattern: , will not work as there might be a comma inside the text. the pattern "," wont work either as there as the numbers are not enclosed in quotes. There must be a simple pattern doing it right but I cannot seem to figure it out myself. regards ------------------------------------------------------------------------ Max M Rasmussen, New Media Director http://www.normik.dk Denmark e-mail mailto:maxm@normik.dk
On Thu, 04 Nov 1999, Max M wrote:
I am trying to import a comma delimited ascii file via Python.
naturally the pattern: , will not work as there might be a comma inside the text.
the pattern "," wont work either as there as the numbers are not enclosed in quotes.
There must be a simple pattern doing it right but I cannot seem to figure it out myself.
Look, then your datafile is corrupt. If you have commas in the text fields then commas as seperator does not work. Period. Choose another delimitor such as TAB. I wrote a long while back a parser for these things. It is trivial. Just a few lines of code. regards, stephan -- Stephan Richter iXL - Software Designer and Engineer CBU - Physics, Computer Science and Chemistry Student
Stephan Richter wrote:
On Thu, 04 Nov 1999, Max M wrote:
I am trying to import a comma delimited ascii file via Python.
naturally the pattern: , will not work as there might be a comma inside the text.
the pattern "," wont work either as there as the numbers are not enclosed in quotes.
There must be a simple pattern doing it right but I cannot seem to figure it out myself.
Look, then your datafile is corrupt. If you have commas in the text fields then commas as seperator does not work. Period.
Not if all fields were delimited by double quotes ("). The algorithm would still be fairly simple then. But now numbers *aren't* enclosed by quotes, which makes it rather more complicated to write, but it's still possible. This would get even more complicated if you do have escaped \" in strings. :)
Choose another delimitor such as TAB. I wrote a long while back a parser for these things. It is trivial. Just a few lines of code.
If you can choose your own datafile, that's best way to do it. If you already have a datafile that you can't control, you need to convert it to a tab delimited file first, for which you'd use Python, and you're stuck with the same problem. ;) Anyway, if you need more help with Python I'd suggest looking in comp.lang.python. Possibly there are modules you can download that do this, too, at www.python.org. Regards, Martijn
If I understand the discussion of the file format correctly, I think I might have a solution for you. If not, it is probably close. Here's a sample data file I was working with to test (notice no escaped quotes in text fields -- not supported): "Text column one", 2, 3, "Text column four" -1, "two", "three", 4.0 I've attached a code file which parses it, creating strings (without the external quotes) and evaluating the numerical values to create the right kind of numbers. The resulting object is a subclass of UserList. The patterns used to parse the file are: TEXT_FIELD_PATTERN = r'"(?P<text>[^"]*)"' NUMBER_FIELD_PATTERN = r'(?P<number>([-+.]|\d)+)' FIELD_PATTERN = r'(' + TEXT_FIELD_PATTERN + r'|' + NUMBER_FIELD_PATTERN + r'),?\s*' Doug #!/usr/bin/env python # # Time-stamp: <99/11/04 11:03:44 hellmann> # """ Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Doug Hellmann not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. DOUG HELLMANN DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL DOUG HELLMANN BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. DESCRIPTION: Parse a comma delemited text file and turn it into a list of tuples. CHANGE HISTORY: $Log: cdf.py,v $ Revision 1.1 1999/11/04 16:02:00 hellmann Class for parsing simple comma delimited files. Does not handle the case where a text field contains an embeded quote. """ # # Creation Information # __rcs_module_name__ = '$RCSfile: cdf.py,v $' __creator__ = ' <hellmann@gnncast.net>' __project__ = 'Python Pybox' __created__ = 'Thu, 04-Nov-1999 10:44:46 EST' # # Current Information # __author__ = '$Author: hellmann $' __version__ = '$Revision: 1.1 $' __date__ = '$Date: 1999/11/04 16:02:00 $' __locker__ = '$Locker: $' # # Import system modules # import UserList import re import string # # Import Local modules # # # Module # class CDF(UserList.UserList): TEXT_FIELD_PATTERN = r'"(?P<text>[^"]*)"' NUMBER_FIELD_PATTERN = r'(?P<number>([-+.]|\d)+)' FIELD_PATTERN = r'(' + TEXT_FIELD_PATTERN + r'|' + NUMBER_FIELD_PATTERN + r'),?\s*' def __init__(self, input): UserList.UserList.__init__(self) reObj = re.compile(self.FIELD_PATTERN) if type(input) == type(''): input = open(input, 'rt') for line in input.readlines(): matchObj = reObj.search(line) rec = () while matchObj: #print matchObj.groupdict() groupdict = matchObj.groupdict() if groupdict['text']: rec = rec + (groupdict['text'],) else: numberText = string.strip(groupdict['number']) if numberText: rec = rec + (eval(groupdict['number']),) matchObj = reObj.search(line, matchObj.end()) self.append(rec) return if __name__ == '__main__': print CDF('cdf.txt')
On Thu, 4 Nov 1999, Max M wrote:
I am trying to import a comma delimited ascii file via Python.
naturally the pattern: , will not work as there might be a comma inside the text.
the pattern "," wont work either as there as the numbers are not enclosed in quotes.
There must be a simple pattern doing it right but I cannot seem to figure it out myself.
Have a look at the shlex library distributed with python 1.5.2. ___ // Zen (alias Stuart Bishop) Work: zen@cs.rmit.edu.au // E N Senior Systems Alchemist Play: zen@shangri-la.dropbear.id.au //__ Computer Science, RMIT WWW: http://www.cs.rmit.edu.au/~zen
O'Reilly puts out a great book called Mastering Regular Expressions by Jeffrey Friedl which I highly recommend for anyone who plans to use regular expressions on a regular basis. (badoom, ching) As others have pointed out, you can't use a simple comma as a delimiter if your fields can contain commas. You'll have to do something like put quotes around the fields if you want to do that. Of course, if you also want to include quotes in the fields then it gets even more complicated. This should all become quite clear once you understand the basics of regular expression parsing, though. It is also possible that using regular expressions alone may not be the best approach. -jfarr ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hi! I'm a signature virus. Copy me into your .sig to join the fun! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I have got a working solution for this. I have done this to read in user supplied datasets. The solution is ugly, slow, but works for me. The items are separated with commas, if a data item contains comma in it, it must be enclosed in '' or "", if not, enclosing is not needed. You can also include escape characters in a data item if closed, like "He said, \"Hello.\"". The function first cut each line by commas, then evaluate quoted items to its python string value, thus getting rid of quotations and escaped characters. Code below: ===================================================== import string, re from cStringIO import StringIO def cut_comma(lines): """This function is used to read comma delimited files(.csv). Input: the rows of the data file in a list 'lines'. Output: a list of list x, x[i] is the i'th row of the input, x[i][j] is the j'th data item on the i'th row. The scanning respects quoting, e.g., the row "Bond, James", 13, 45 will be recognized as three column ["Bond, James", "13", "45"] successfully. Escape characters like 'That\'s absolutely possible.', 1999, 8, 30 are treated correctly. Shortcoming: very slow when reading large datasets. On a SUN Sparc 2, reading 10000 lines of input took 20 seconds. Suggestion is welcome. Author: Li Dongfeng, ldf@statms.stat.pku.edu.cn Last modified: 1999.9.10 """ # here I used Tim Peter's sugestion, but doesn't seem any faster. r = re.compile(r""" \s* # data item can start with any number of spaces (?P<item> # start of data item we need to extract '[^'\\\n]*(?:\\.[^'\\\n]*)*' # matches any thing enclosed in '...'(can have comma) # but escaped characters are escaped. without newline. | "[^"\\\n]*(?:\\.[^"\\\n]*)*" # matches any thing enclosed in "..."(can have comma) # but escaped characters are escaped. without newline. | [^,]* ) (?:\s* , # end with a comma(can have spaces before it) ) """, re.VERBOSE) r2=re.compile(r"""^(['"]).*\1$""", re.MULTILINE) # anything quoted def quote(match): return eval(match.group()) if type(lines) is type(()) or type(lines) is type([]): # sub comma with newline, but respect string quoting x = map(lambda s, r=r: r.sub("\\g<item>\n", s), lines) # unquote all quoted x = map(lambda s, r=r2, f=quote: r.sub(f, s), x) # split the rows x = map(lambda s: string.split(s, "\n"), x) else: x = r.sub("\\g<item>\n", lines) x = r2.sub(quote, x) x = string.split(x, "\n") return x ========================================== Max M wrote:
I am trying to import a comma delimited ascii file via Python.
naturally the pattern: , will not work as there might be a comma inside the text.
the pattern "," wont work either as there as the numbers are not enclosed in quotes.
There must be a simple pattern doing it right but I cannot seem to figure it out myself.
regards ------------------------------------------------------------------------ Max M Rasmussen, New Media Director http://www.normik.dk Denmark e-mail mailto:maxm@normik.dk
_______________________________________________ Zope maillist - Zope@zope.org http://lists.zope.org/mailman/listinfo/zope No cross posts or HTML encoding! (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope-dev )
Thanks for all the replies. I will be working on with Li Dongfeng's solutions as it seems to care og most of the details. (As soon as I get it formatted again after extracting it from the e-mail ;-) ) Warm regards Max M ------------------------------------------------------------------------ Max M Rasmussen, New Media Director http://www.normik.dk Denmark e-mail mailto:maxm@normik.dk Desktop@home: Win98, RH Linux 6.0, BeOS 4,5 NoteBook@home: Win 2000 Beta3, RH Linux 6.0 1. Desktop@work: Win NT4 SP4 2. Desktop@work: RH Linux 6.0 ----- Original Message ----- From: Li Dongfeng <mavip5@inet.polyu.edu.hk> To: Max M <maxm@normik.dk> Cc: <Zope@zope.org> Sent: Friday, November 05, 1999 5:55 AM Subject: Re: [Zope] Comma delimited file and the re python object
I coded it in 'C' and the performance is about ~60X faster than the regular expression solution. And handles nested quotes. Send me a note if this is something worth finishing up and posting somewhere. ####### Test txt=""" "Bond, James", 13, 45 'That\'s "absolutely" possible.', 1999, 8, 30 """ buf=txt buf=buf*50000 print len(buf) t1=time.time() l=split(buf) print len(l) print "Time: %.2f"%(time.time()-t1) print l[4] print l[5] #### Output: 3400000 # 3.4meg input 100000 # Number of records Time: 1.91 # Runtime # Some output ['Bond', ' James', '13', '45'] ['That\'s "absolutely" possible.', '1999', '8', '30'] --Darrell ----- Original Message ----- From: Li Dongfeng <mavip5@inet.polyu.edu.hk> To: Max M <maxm@normik.dk> Cc: <Zope@zope.org> Sent: Thursday, November 04, 1999 11:55 PM Subject: Re: [Zope] Comma delimited file and the re python object
I have got a working solution for this. I have done this to read in user supplied datasets.
The solution is ugly, slow, but works for me.
participants (8)
-
Darrell -
Doug Hellmann -
Jonothan Farr -
Li Dongfeng -
Martijn Faassen -
Max M -
Stephan Richter -
Stuart 'Zen' Bishop