Load CSV to Pandas MultiIndex DataFrame

36,155

Solution 1

You could use pd.read_csv:

>>> df = pd.read_csv("test_data2.csv", index_col=[0,1], skipinitialspace=True)
>>> df
                       dep  freq   arr   code  mode
from       to                                      
RGBOXFD    RGBPADTON   127     0    27  99999     2
           RGBPADTON   127     0    33  99999     2
           RGBRDLEY    127     0  1425  99999     2
           RGBCHOLSEY  127     0    52  99999     2
           RGBMDNHEAD  127     0    91  99999     2
RGBDIDCOTP RGBPADTON   127     0    46  99999     2
           RGBPADTON   127     0     3  99999     2
           RGBCHOLSEY  127     0    61  99999     2
           RGBRDLEY    127     0  1430  99999     2
           RGBPADTON   127     0   115  99999     2

where I've used skipinitialspace=True to get rid of those annoying spaces in the header row.

Solution 2

from_csv() works similarly:

import pandas as pd

df = pd.DataFrame.from_csv(
    'data.txt',
    index_col = [0, 1]
)

print df

--output:--
                        dep   freq   arr   code   mode
from        to                                        
RGBOXFD    RGBPADTON    127      0    27  99999      2
           RGBPADTON    127      0    33  99999      2
           RGBRDLEY     127      0  1425  99999      2
           RGBCHOLSEY   127      0    52  99999      2
           RGBMDNHEAD   127      0    91  99999      2
RGBDIDCOTP RGBPADTON    127      0    46  99999      2
           RGBPADTON    127      0     3  99999      2
           RGBCHOLSEY   127      0    61  99999      2
           RGBRDLEY     127      0  1430  99999      2
           RGBPADTON    127      0   115  99999      2

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html#pandas.DataFrame.from_csv

From this discussion,

https://github.com/pydata/pandas/issues/4916

it looks like read_csv() was implemented to allow you to set more options, which makes from_csv() superfluous.

Share:
36,155
handloomweaver
Author by

handloomweaver

Updated on March 30, 2020

Comments

  • handloomweaver
    handloomweaver about 4 years

    I have a 719mb CSV file that looks like:

    from, to, dep, freq, arr, code, mode   (header row)
    RGBOXFD,RGBPADTON,127,0,27,99999,2
    RGBOXFD,RGBPADTON,127,0,33,99999,2
    RGBOXFD,RGBRDLEY,127,0,1425,99999,2
    RGBOXFD,RGBCHOLSEY,127,0,52,99999,2
    RGBOXFD,RGBMDNHEAD,127,0,91,99999,2
    RGBDIDCOTP,RGBPADTON,127,0,46,99999,2
    RGBDIDCOTP,RGBPADTON,127,0,3,99999,2
    RGBDIDCOTP,RGBCHOLSEY,127,0,61,99999,2
    RGBDIDCOTP,RGBRDLEY,127,0,1430,99999,2
    RGBDIDCOTP,RGBPADTON,127,0,115,99999,2
    and so on... 
    

    I want to load in to a pandas DataFrame. Now I know there is a load from csv method:

     r = pd.DataFrame.from_csv('test_data2.csv')
    

    But I specifically want to load it as a 'MultiIndex' DataFrame where from and to are the indexes:

    So ending up with:

                       dep, freq, arr, code, mode
    RGBOXFD RGBPADTON  127     0   27  99999    2
            RGBRDLEY   127     0   33  99999    2
            RGBCHOLSEY 127     0 1425  99999    2
            RGBMDNHEAD 127     0 1525  99999    2
    

    etc. I'm not sure how to do that?

  • Jeff Tratner
    Jeff Tratner over 10 years
    FYI: from_csv is deprecated (it does the same thing as read_csv, but with some surprising settings, and sometimes behaves strangely with date parsing).
  • 7stud
    7stud over 10 years
    from_csv is deprecated I'm not seeing that.
  • Jeff Tratner
    Jeff Tratner over 10 years
    You're absolutely right, pandas hasn't deprecated it yet. Sorry! That said, we've been thinking about deprecating. It's also just a little weird, like how it defaults to parses ints as dates, and we've been discussing deprecating it I guess).
  • 7stud
    7stud over 10 years
    Np. I read the discussion at your link as well as the discussion it links to in the last post, which is the link I included in my post.
  • safay
    safay about 2 years
    And if you're looking for column-wise multi-index, instead call: df = pd.read_csv("data.csv", header=[0,1], index_col=0))