How do I refer to the index of my Pandas dataframe?

24,229

Solution 1

I think you have a slight misunderstanding of what indexes are. You don't just "designate" columns as indexes; that is, you don't just "tag" certain columns with info that says "this is an index". The index is a separate data structure that can hold data that aren't even present in the columns. If you do set_index, you move those columns into the index, so they no longer exist as regular columns. This is why you can no longer use them in the ways you mention: they aren't there anymore.

One thing you can do is, when using set_index, pass drop=False to tell it to keep the columns as columns in addition to putting them in the index (effectively copying them to the index rather than moving them), e.g., df.set_index('SomeColumn', drop=False). However, you should be aware that the index and column are still distinct, so for instance if you modify the column values this will not affect what's stored in the index.

The upshot is that indexes aren't really columns of the DataFrame, so if you want to be able to use some data as both an index and a column, you need to duplicate it in both places. There is some discussion of this issue here.

Solution 2

The information is accessible using the index's get_level_values method:

import numpy as np
import pandas as pd
np.random.seed(1)

df = pd.DataFrame(np.random.randint(4, size=(10,4)), columns=list('ABCD'))    
idf = df.set_index(list('AB'))

idf.index.get_level_values('A') is roughly equivalent to df['A']. Note the change in type and dtype, however:

print(df['A'])
# 0    1
# 1    3
# 2    3
# 3    0
# 4    2
# 5    2
# 6    3
# 7    1
# 8    3
# 9    3
# Name: A, dtype: int32

def level(df, lvl):
    return df.index.get_level_values(lvl)

print(level(idf, 'A'))
# Int64Index([1, 3, 3, 0, 2, 2, 3, 1, 3, 3], dtype='int64')

And here again, instead of selecting the column with ['A'], you can get the equivalent information using .index.get_level_values('A'):

print(df.query('3>C>0 and D>0')['A'])
# 8    3
# Name: A, dtype: int32

print(level(idf.query('3>C>0 and D>0'), 'A'))
# Int64Index([3], dtype='int64')

PS. One of the golden rules of database design is, "Never repeat the same data in two places" since sooner or later the data will become inconsistent and thus corrupted. So I would recommend against keeping the data as both a column and an index, primarily because it could lead to data corruption, but also because it could be an inefficient use of memory.

Share:
24,229
orome
Author by

orome

"I mingle the probable with the necessary and draw a plausible conclusion from the mixture."

Updated on April 27, 2020

Comments

  • orome
    orome about 4 years

    I have a Pandas dataframe where I have designated some of the columns as indices:

    planets_dataframe.set_index(['host','name'], inplace=True)
    

    and would like to be able to refer to these indices in a variety of contexts. Using the name of an index works fine in queries

    planets_dataframe.query('host == "PSR 1257 12"')
    

    but results in an error if try to use it to get a list of the values of an index as I could when it was a column

    planets_dataframe.name
    #AttributeError: 'DataFrame' object has no attribute 'name'
    

    or to use it to list results as I could when it was a "regular" column

    planets_dataframe.query('30 > mass > 20 and discoveryyear > 2009')['name']
    #KeyError: u'no item named name'
    

    How do I refer to the "columns" of the dataframe that I'm using as indexes?


    Before set_index:

    planets_dataframe.columns
    # Index([u'name', u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'host', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')
    

    After set_index:

    planets_dataframe.columns
    #Index([u'lastupdate', u'temperature', u'semimajoraxis', u'discoveryyear', u'calculated', u'period', u'age', u'mass', u'verification', u'transittime', u'eccentricity', u'radius', u'discoverymethod', u'inclination'], dtype='object')
    
  • orome
    orome about 10 years
    Any thoughts on best practices in this case? Using drop=False should solve the problem; but loose some semantics provided by the index (especially its hierarchical nature) and cause some redundancy.
  • BrenBarn
    BrenBarn about 10 years
    @raxacoricofallapatorius: How will it "lose some semantics"? Using drop=False won't change the semantics of the Index at all, it will just let you use the columns as both index and regular column. As for the redundancy, it is maybe a bit annoying conceptually, but I haven't found it to be a huge problem in practice.
  • orome
    orome about 10 years
    Sorry. Two scenarios flowed together there. So there's nothing wrong with keeping, I guess (I suppose that's what drop=False is there for in the first place).
  • BrenBarn
    BrenBarn about 10 years
    This is right, but if you are using these columns a lot, the verbosity of getting level values this way quickly becomes a pain. Whether to duplicate or use this approach probably depends on how often you need to use the data in each way.