When to use Category rather than Object?
Solution 1
Use a category when there is lots of repetition that you expect to exploit.
For example, suppose I want the aggregate size per exchange for a large table of trades. Using the default object
is totally reasonable:
In [6]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 1.25 ms per loop
But since the list of possible exchanges is pretty small, and because there is lots of repetition, I could make this faster by using a category
:
In [7]: trades['exch'] = trades['exch'].astype('category')
In [8]: %timeit trades.groupby('exch')['size'].sum()
1000 loops, best of 3: 702 µs per loop
Note that categories are really a form of dynamic enumeration. They are most useful if the range of possible values is fixed and finite.
Solution 2
The Pandas documentation has a concise section on when to use the categorical
data type:
The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
user4640449
Updated on June 03, 2022Comments
-
user4640449 almost 2 years
I have a CSV dataset with 40 features that I am handling with Pandas. 7 features are continuous (
int32
) and the rest of them are categorical.My question is :
Should I use the
dtype('category')
of Pandas for the categorical features, or can I let the defaultdtype('object')
? -
user4640449 almost 9 yearsThanks for your answers ! So Categorical type is better for memory optimization.
-
Jeff almost 9 yearsThe other reason to use Categoricals, is that they can provide (as its not the default), an ordering to your categories. E.g. maybe ['small','medium','large']. Then you can sort by this! See the docs here