Selecting rows from a Pandas dataframe with a compound (hierarchical) index
Solution 1
Try using xs
to be very precise:
In [5]: df.xs('a', level=0)
Out[5]:
value1 value2
group2
c 1.1 7.1
c 2.0 8.0
d 3.0 9.0
In [6]: df.xs('c', level='group2')
Out[6]:
value1 value2
group1
a 1.1 7.1
a 2.0 8.0
Solution 2
Syntax like the following will work:
df.ix['a']
df.ix['a'].ix['c']
since group1
and group2
are indices. Please forgive my previous attempt!
To get at the second index only, I think you have to swap indices:
df.swaplevel(0,1).ix['c']
But I'm sure Wes will correct me if I'm wrong.
Solution 3
In Python 0.19.0 there is a new suggested approach, which is explained here1. I believe the clearest example they give is the following, in which they slice from a four-level indexing. This is how the dataframe is made:
In [46]: def mklbl(prefix,n):
....: return ["%s%s" % (prefix,i) for i in range(n)]
....:
In [47]: miindex = pd.MultiIndex.from_product([mklbl('A',4),
....: mklbl('B',2),
....: mklbl('C',4),
....: mklbl('D',2)])
....:
In [48]: micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
....: ('b','foo'),('b','bah')],
....: names=['lvl0', 'lvl1'])
....:
In [49]: dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
....: index=miindex,
....: columns=micolumns).sort_index().sort_index(axis=1)
....:
In [50]: dfmi
Out[50]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
D1 21 20 23 22
C3 D0 25 24 27 26
... ... ... ... ...
A3 B1 C0 D1 229 228 231 230
C1 D0 233 232 235 234
D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
And this is how they select the different rows:
In [51]: dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:]
Out[51]:
lvl0 a b
lvl1 bar foo bah foo
A1 B0 C1 D0 73 72 75 74
D1 77 76 79 78
C3 D0 89 88 91 90
D1 93 92 95 94
B1 C1 D0 105 104 107 106
D1 109 108 111 110
C3 D0 121 120 123 122
... ... ... ... ...
A3 B0 C1 D1 205 204 207 206
C3 D0 217 216 219 218
D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
So quite simply, in df.loc[(indices),:]
, you specify the indices that you want to select per level, from highest level to lowest. If you do not want to make a selection of the lowest level(s) of indices, you can omit specifying them. If you don't want to make a slice between other specified levels, you add slice(None)
. Both cases are shown in the example, where level D is omitted and level B is specified between A and C.
JD Long
Only slightly ashamed creator of disgusting and frustrating code. I'm a data guy not a programmer. But sometimes I have to program my data into submission.
Updated on July 09, 2022Comments
-
JD Long almost 2 years
I'm suspicious that this is trivial, but I yet to discover the incantation that will let me select rows from a Pandas dataframe based on the values of a hierarchical key. So, for example, imagine we have the following dataframe:
import pandas df = pandas.DataFrame({'group1': ['a','a','a','b','b','b'], 'group2': ['c','c','d','d','d','e'], 'value1': [1.1,2,3,4,5,6], 'value2': [7.1,8,9,10,11,12] }) df = df.set_index(['group1', 'group2'])
df looks as we would expect:
If df were not indexed on group1 I could do the following:
df['group1' == 'a']
But that fails on this dataframe with an index. So maybe I should think of this like a Pandas series with a hierarchical index:
df['a','c']
Nope. That fails as well.
So how do I select out all the rows where:
- group1 == 'a'
- group1 == 'a' & group2 == 'c'
- group2 == 'c'
- group1 in ['a','b','c']