When to use Category rather than Object?

17,351

Solution 1

Use a category when there is lots of repetition that you expect to exploit.

For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object is totally reasonable:

In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop

But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category:

In [7]: trades['exch'] = trades['exch'].astype('category')

In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 µs per loop

Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.

Solution 2

The Pandas documentation has a concise section on when to use the categoricaldata type:

The categorical data type is useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
Share:
17,351
user4640449
Author by

user4640449

Updated on June 03, 2022

Comments

  • user4640449
    user4640449 almost 2 years

    I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (int32) and the rest of them are categorical.

    My question is :

    Should I use the dtype('category') of Pandas for the categorical features, or can I let the default dtype('object')?

  • user4640449
    user4640449 almost 9 years
    Thanks for your answers ! So Categorical type is better for memory optimization.
  • Jeff
    Jeff almost 9 years
    The other reason to use Categoricals, is that they can provide (as its not the default), an ordering to your categories. E.g. maybe ['small','medium','large']. Then you can sort by this! See the docs here