Reshaping dataframes in pandas based on column labels

15,294

I'm assuming you already have the DataFrame. In which case you can just turn the columns into a MultiIndex and use stack then reset_index. Note that you'll then have to rename and reorder the columns and sort by sample to get exactly what you posted in the question:

In [4]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})

In [5]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [6]: df.stack(0).reset_index(1)
Out[6]: 
  level_1         x         y
0      s1  0.897994 -0.278357
0      s2 -0.008126 -1.701865
1      s1 -1.354633 -0.890960
1      s2 -0.773428  0.003501
2      s1 -1.499422 -1.518993
2      s2  0.240226  1.773427
3      s1 -1.090921  0.847064
3      s2 -1.061303  1.557871
4      s1 -1.697340 -0.160952
4      s2 -0.930642  0.182060
5      s1 -0.356076 -0.661811
5      s2  0.539875 -1.033523
6      s1 -0.687861 -1.450762
6      s2  0.700193  0.658959
7      s1 -0.130422 -0.826465
7      s2 -0.423473 -1.281856
8      s1  0.306983  0.433856
8      s2  0.097279 -0.256159
9      s1  0.498057  0.147243
9      s2  1.312578  0.111837

You can save the MultiIndex conversion if you can just create the DataFrame with a MultiIndex instead.

Edit: use merge to join original ids back in

In [59]: df
Out[59]: 
   names      s1_x      s1_y      s2_x      s2_y
0      0  0.732099  0.018387  0.299856  0.737142
1      1  0.914755 -0.798159 -0.732868 -1.279311
2      2 -1.063558  0.161779 -0.115751 -0.251157
3      3 -1.185501  0.095147 -1.343139 -0.003084
4      4  0.622400 -0.299726  0.198710 -0.383060
5      5  0.179318  0.066029 -0.635507  1.366786
6      6 -0.820099  0.066067  1.113402  0.002872
7      7  0.711627 -0.182925  1.391194 -2.788434
8      8 -1.124092  1.303375  0.202691 -0.225993
9      9 -0.179026  0.847466 -1.480708 -0.497067

In [60]: id = df.ix[:, ['names']]

In [61]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [62]: pandas.merge(df.stack(0).reset_index(1), id, left_index=True, right_index=True)
Out[62]: 
  level_1         x         y  names
0      s1  0.732099  0.018387      0
0      s2  0.299856  0.737142      0
1      s1  0.914755 -0.798159      1
1      s2 -0.732868 -1.279311      1
2      s1 -1.063558  0.161779      2
2      s2 -0.115751 -0.251157      2
3      s1 -1.185501  0.095147      3
3      s2 -1.343139 -0.003084      3
4      s1  0.622400 -0.299726      4
4      s2  0.198710 -0.383060      4
5      s1  0.179318  0.066029      5
5      s2 -0.635507  1.366786      5
6      s1 -0.820099  0.066067      6
6      s2  1.113402  0.002872      6
7      s1  0.711627 -0.182925      7
7      s2  1.391194 -2.788434      7
8      s1 -1.124092  1.303375      8
8      s2  0.202691 -0.225993      8
9      s1 -0.179026  0.847466      9
9      s2 -1.480708 -0.497067      9

Alternatively:

    In [64]: df
Out[64]: 
   names      s1_x      s1_y      s2_x      s2_y
0      0  0.744742 -1.123403  0.212736  0.005440
1      1  0.465075 -0.673491  1.467156 -0.176298
2      2 -1.111566  0.168043 -0.102142 -1.072461
3      3  1.226537 -1.147357 -1.583762 -1.236582
4      4  1.137675  0.224422  0.738988  1.528416
5      5 -0.237014 -1.110303 -0.770221  1.389714
6      6 -0.659213  2.305374 -0.326253  1.416778
7      7  1.524214 -0.395451 -1.884197  0.524606
8      8  0.375112 -0.622555  0.295336  0.927208
9      9  1.168386 -0.291899 -1.462098  0.250889

In [65]: df = df.set_index('names')

In [66]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])

In [67]: df.stack(0).reset_index(1)
Out[67]: 
      level_1         x         y
names                            
0          s1  0.744742 -1.123403
0          s2  0.212736  0.005440
1          s1  0.465075 -0.673491
1          s2  1.467156 -0.176298
2          s1 -1.111566  0.168043
2          s2 -0.102142 -1.072461
3          s1  1.226537 -1.147357
3          s2 -1.583762 -1.236582
4          s1  1.137675  0.224422
4          s2  0.738988  1.528416
5          s1 -0.237014 -1.110303
5          s2 -0.770221  1.389714
6          s1 -0.659213  2.305374
6          s2 -0.326253  1.416778
7          s1  1.524214 -0.395451
7          s2 -1.884197  0.524606
8          s1  0.375112 -0.622555
8          s2  0.295336  0.927208
9          s1  1.168386 -0.291899
9          s2 -1.462098  0.250889
Share:
15,294
Admin
Author by

Admin

Updated on June 05, 2022

Comments

  • Admin
    Admin almost 2 years

    What is the best way to reshape the following dataframe in pandas? This DataFrame df has x,y values for each sample (s1 and s2 in this case) and looks like this:

    In [23]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
    In [24]: df
    Out[24]: 
           s1_x      s1_y      s2_x      s2_y
    0  0.913462  0.525590 -0.377640  0.700720
    1  0.723288 -0.691715  0.127153  0.180836
    2  0.181631 -1.090529 -1.392552  1.530669
    3  0.997414 -1.486094  1.207012  0.376120
    4 -0.319841  0.195289 -1.034683  0.286073
    5  1.085154 -0.619635  0.396867  0.623482
    6  1.867816 -0.928101 -0.491929 -0.955295
    7  0.920658 -1.132057  1.701582 -0.110299
    8 -0.241853 -0.129702 -0.809852  0.014802
    9 -0.019523 -0.578930  0.803688 -0.881875
    

    s1_x and s1_y are the x/y values for sample 1, s2_x, s2_y are the sample values for sample 2, etc. How can this be reshaped into a DataFrame containing only x, y columns but that contains an additional column sample that says for each row in the DataFrame whether it's from s1 or s2? E.g.

              x         y      sample
    0  0.913462  0.525590          s1
    1  0.723288 -0.691715          s1
    2  0.181631 -1.090529          s1
    3  0.997414 -1.486094          s1
    ...
    5  0.396867  0.623482          s2
    ...
    

    This is useful for plotting things with Rpy2 later on, since many R plotting features can make use of this grouping variable, so that's my motivation for reshaping the dataframe.

    I think the answer given by Chang She doesn't translate to dataframes that have a unique index, like this one:

    In [636]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10), "names": range(10)})
    In [637]: df
    Out[637]: 
       names      s1_x      s1_y      s2_x      s2_y
    0      0  0.672298  0.415366  1.034770  0.556209
    1      1  0.067087 -0.851028  0.053608 -0.276461
    2      2 -0.674174 -0.099015  0.864148 -0.067240
    3      3  0.542996 -0.813018  2.283530  2.793727
    4      4  0.216633 -0.091870 -0.746411 -0.421852
    5      5  0.141301 -1.537721 -0.371601 -1.594634
    6      6  1.267148 -0.833120  0.369516 -0.671627
    7      7 -0.231163 -0.557398  1.123155  0.865140
    8      8  1.790570 -0.428563  0.668987  0.632409
    9      9 -0.820315 -0.894855  0.673247 -1.195831
    In [638]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
    
    In [639]: df.stack(0).reset_index(1)
    Out[639]: 
      level_1         x         y
    0      s1  0.672298  0.415366
    0      s2  1.034770  0.556209
    1      s1  0.067087 -0.851028
    1      s2  0.053608 -0.276461
    2      s1 -0.674174 -0.099015
    2      s2  0.864148 -0.067240
    3      s1  0.542996 -0.813018
    3      s2  2.283530  2.793727
    4      s1  0.216633 -0.091870
    4      s2 -0.746411 -0.421852
    5      s1  0.141301 -1.537721
    5      s2 -0.371601 -1.594634
    6      s1  1.267148 -0.833120
    6      s2  0.369516 -0.671627
    7      s1 -0.231163 -0.557398
    7      s2  1.123155  0.865140
    8      s1  1.790570 -0.428563
    8      s2  0.668987  0.632409
    9      s1 -0.820315 -0.894855
    9      s2  0.673247 -1.195831
    

    The transformation worked but in the process the column "names" was lost. How can I keep the "names" column in the df while still doing the melting transformation on the columns that have _ in their names? The "names" column just assigns a unique name to each row in the dataframe. It's numeric here for example but in my data they are string identifiers.

    thanks.

  • Admin
    Admin about 11 years
    Could you explain the reset_index call and what it is doing here? Also, is it possible to get rid of the level_1 index? I also don't see how I can access the labels columns for each row using this
  • Chang She
    Chang She about 11 years
    After the call to stack, "s1 s2 s1 s2 ..." is part of the row labels (index). reset_index simply makes it a column instead. After this it's still a DataFrame so index/columns access is still the same.
  • Admin
    Admin about 11 years
    I modified my question to reflect a problem with this solution for uniquely indexed dataframes. Do you have an idea of how this solution could be adapted to those? thank you
  • Chang She
    Chang She about 11 years
    If that's the case, use merge or set the name to be the index to begin with. I'm updating the solution with this