pandas.read_csv from string or package data

52,344

Solution 1

The following worked for me in 3.3:

>>> import numpy as np, pandas as pd
>>> import io, pkgutil
>>> wells = pkgutil.get_data('pymc.examples', 'data/wells.dat')
>>> type(wells)
<class 'bytes'>
>>> df = pd.read_csv(io.BytesIO(wells), encoding='utf8', sep=" ", index_col="id", dtype={"switch": np.int8})
>>> df.head()
    switch  arsenic       dist  assoc  educ
id                                         
1        1     2.36  16.826000      0     0
2        1     0.71  47.321999      0     0
3        0     2.07  20.966999      0    10
4        1     1.15  21.486000      0    12
5        1     1.10  40.874001      1    14

[5 rows x 5 columns]

N.B. I had to manually put wells.dat in that location, so I can't swear I copied it correctly and that there isn't terminal whitespace, because I deleted some. But passing read_csv a BytesIO object and an encoding parameter should work. (Actually, you can probably get away without it, but it's a good habit. io.TextIOWrapper might be another option.)

Solution 2

To pass a string to pandas read_csv(), you can use io.StringIO, i.e.:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("csv string..."))
Share:
52,344
John Salvatier
Author by

John Salvatier

Large scale statistics, functional programming, etc

Updated on July 09, 2022

Comments

  • John Salvatier
    John Salvatier almost 2 years

    I have some csv text data in a package which I want to read using read_csv. I was doing this by

    from pkgutil import get_data
    from StringIO import StringIO
    
    data = read_csv(StringIO(get_data('package.subpackage', 'path/to/data.csv')))
    

    However, StringIO.StringIO disappears in Python 3, and io.StringIO only accepts Unicode. Is there a simple way to do this?

    Edit: the following does not appear to work

    import pandas as pd
    
    import pkgutil
    from io import StringIO
    
    def get_data_file(pkg, path):
        f = StringIO()
        contents = unicode(pkgutil.get_data('pymc.examples', 'data/wells.dat'))
        f.write(contents)
        return f
    
    wells = get_data_file('pymc.examples', 'data/wells.dat')
    
    data = pd.read_csv(wells, delimiter=' ', index_col='id',
                       dtype={'switch': np.int8})
    

    failing with

      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in parser_f
        return _read(filepath_or_buffer, kwds)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 209, in _read
        parser = TextFileReader(filepath_or_buffer, **kwds)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 509, in __init__
        self._make_engine(self.engine)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 611, in _make_engine
        self._engine = CParserWrapper(self.f, **self.options)
      File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 893, in __init__
        self._reader = _parser.TextReader(src, **kwds)
      File "parser.pyx", line 441, in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3940)
      File "parser.pyx", line 551, in pandas._parser.TextReader._get_header (pandas/src/parser.c:5096)
    pandas._parser.CParserError: Passed header=0 but only 0 lines in file
    
  • John Salvatier
    John Salvatier over 10 years
    Thanks, I had figured out how to do it with io.StringIO(unicode(wells)), but this seems better.
  • addicted
    addicted over 6 years
    Thanks! this is very helpful. I am losing my mind over how to read the <class 'byte'> file using pd.read_csv
  • Nikhil VJ
    Nikhil VJ almost 6 years
    thanks! I had CSV file contents uploaded through formdata in post request. this worked: df = pd.read_csv( io.BytesIO( self.request.files['file1'][0]['body']) )
  • embulldogs99
    embulldogs99 almost 4 years
    This works well for .dat files as well. I am replacing the .dat file's spaces with commas then using the above code to convert the comma separated string into a pandas df
  • InnocentBystander
    InnocentBystander over 2 years
    @embulldogs99 you don't need to replace spaces, just use space as the field separator with sep=' '