What is the point of indexing in pandas?

19,017

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows based on index values is like looking up dict values based on a key.

In contrast, the values in a column are like values in a list.

Looking up rows based on index values is faster than looking up rows based on column values.

For example, consider

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])

Here is how you could look up any row where the df['index'] column equals 999. Pandas has to loop through every value in the column to find the ones equal to 999.

df[df['index'] == 999]

#           foo  index
# 999  0.375489    999

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

df_with_index.loc[999]
# foo        0.375489
# index    999.000000
# Name: 999, dtype: float64

Looking up rows by index is much faster than looking up rows by column value:

In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop

In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop

Note however, it takes time to build the index:

In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop

So having the index is only advantageous when you have many lookups of this type to perform.

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt, lreshape, and crosstab, all use or manipulate the index. Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash lookups.

Share:
19,017
user3659451
Author by

user3659451

Updated on June 23, 2022

Comments

  • user3659451
    user3659451 almost 2 years

    Can someone point me to a link or provide an explanation of the benefits of indexing in pandas? I routinely deal with tables and join them based on columns, and this joining/merging process seems to re-index things anyway, so it's a bit cumbersome to apply index criteria considering I don't think I need to.

    Any thoughts on best-practices around indexing?

  • user3659451
    user3659451 over 9 years
    So not just lookups, but also merge operations would be faster if index columns are used, right?
  • unutbu
    unutbu over 9 years
    Yes, join calls merge (in most situations). Merging by index is faster than merging by columns because of the fast lookups. So ultimately it all comes back to the fast lookup ability.
  • puifais
    puifais over 6 years
    So if I index my table, then do a group by based on my_column, would it still be faster? It sounds like group by based on the index would be faster. But what about based on another column values?
  • unutbu
    unutbu over 6 years
    @puifais: For sufficiently large DataFrames, grouping by the index is faster than grouping on column values. For small DataFrames, grouping on column values can actually be faster. Moreover, calling set_index to convert a column to an index level also takes time, so either the DataFrame has to be sufficiently large or there has to be multiple groupby operations (thus amortizing the cost of calling set_index) to make grouping by the index pay off.
  • unutbu
    unutbu over 6 years
    @puifais: In general you have to test your code both ways to know for sure which way is faster.
  • puifais
    puifais over 6 years
    But it would be of no help if I index a table but perform a group by on a different column, correct? The performance would be equally as fast as if I were to do this on this same table but without index?
  • CashIsClay
    CashIsClay almost 6 years
    I imagine indexing a table just to do the merge defeats the purpose, though? I'm guessing it's only worth setting an index that doesn't already exist doesn't make sense unless you plan on more than one index-based operation.
  • John Greenall
    John Greenall over 4 years
    @unutbu are Dataframes indexes really based on hashtables? performance seems to degrade as dataframe size increases in a way dictionary performance does not. if I do: df = pd.DataFrame(data={'a':range(int(10e5))}) %timeit df.loc[4000] then increase size of df to 10e7 then time goes up by around 40%
  • towr
    towr almost 3 years
    You can find the hashmap for the index in df.index._engine.mapping (tested on pandas 1.1.0). The hashmap in pandas is a little slower than a python dict (5 times slower in my test: 45ns for dct[key] vs 232ns for df.index._engine.mapping.get_item(key)). But that effect pales in comparison to the rest of the machinery involved in doing a lookup (which brings it to 85µs for df.loc[key]). So I'd avoid using it anywhere that a normal dict will do.