[Zope] Comma delimited file and the re python object

Li Dongfeng mavip5@inet.polyu.edu.hk
Fri, 05 Nov 1999 12:55:59 +0800


I have got a working solution for this.
I have done this to read in user supplied datasets.

The solution is ugly, slow, but works for me.

The items are separated with commas,
if a data item contains comma in it,
it must be enclosed in '' or "",
if not, enclosing is not needed.
You can also include escape characters
in a data item if closed, like
"He said, \"Hello.\"".

The function first cut each line by commas,
then evaluate quoted items to its python
string value, thus getting rid of quotations
and escaped characters.

Code below:

=====================================================
import string, re
from cStringIO import StringIO

def cut_comma(lines):
    """This function is used to read comma delimited files(.csv).

    Input: the rows of the data file in a list 'lines'.
    Output: a list of list x, x[i] is the i'th row of the input, 
            x[i][j] is the j'th data item on the i'th row.

    The scanning respects quoting, e.g., the row
          "Bond, James", 13, 45
    will be recognized as three column ["Bond, James", "13", "45"]
successfully. 
    Escape characters like 
          'That\'s absolutely possible.', 1999, 8, 30
    are treated correctly.

    Shortcoming: very slow when reading large datasets. On a SUN Sparc
2, reading
    10000 lines of input took 20 seconds. Suggestion is welcome.

    Author: Li Dongfeng, ldf@statms.stat.pku.edu.cn
    Last modified: 1999.9.10
    """
     
    # here I used Tim Peter's sugestion, but doesn't seem any faster.
    r = re.compile(r"""
        \s*     # data item can start with any number of spaces
        (?P<item>        # start of data item we need to extract
          '[^'\\\n]*(?:\\.[^'\\\n]*)*'   # matches any thing enclosed in
'...'(can have comma) 
                                         # but escaped characters are
escaped. without newline.
         |
          "[^"\\\n]*(?:\\.[^"\\\n]*)*"   # matches any thing enclosed in
"..."(can have comma) 
                                         # but escaped characters are
escaped. without newline.
         |
          [^,]*
        )            
        (?:\s*
         ,  #  end with a comma(can have spaces before it)
        )    
       """, re.VERBOSE)
    r2=re.compile(r"""^(['"]).*\1$""", re.MULTILINE)  # anything quoted
    def quote(match):
        return eval(match.group())
    if type(lines) is type(()) or type(lines) is type([]):
        # sub comma with newline, but respect string quoting
        x = map(lambda s, r=r: r.sub("\\g<item>\n", s), lines)
        # unquote all quoted
        x = map(lambda s, r=r2, f=quote: r.sub(f, s), x)  
        # split the rows
        x = map(lambda s: string.split(s, "\n"), x)
    else:
        x = r.sub("\\g<item>\n", lines)
        x = r2.sub(quote, x)
        x = string.split(x, "\n")
    return x

==========================================

Max M wrote:
> 
> I am trying to import a comma delimited ascii file via Python.
> 
> naturally the pattern: , will not work as there might be a comma inside the
> text.
> 
> the pattern "," wont work either as there as the numbers are not enclosed in
> quotes.
> 
> There must be a simple pattern doing it right but I cannot seem to figure it
> out myself.
> 
> regards
> ------------------------------------------------------------------------
> Max M Rasmussen,   New Media Director    http://www.normik.dk   Denmark
> e-mail                                   mailto:maxm@normik.dk
> 
> _______________________________________________
> Zope maillist  -  Zope@zope.org
> http://lists.zope.org/mailman/listinfo/zope
>           No cross posts or HTML encoding!
> (Related lists -
>  http://lists.zope.org/mailman/listinfo/zope-announce
>  http://lists.zope.org/mailman/listinfo/zope-dev )