Pandas groupby with categories with redundant nan

15,449

Solution 1

Since Pandas 0.23.0, the groupby method can now take a parameter observed which fixes this issue if it is set to True (False by default). Below is the exact same code as in the question with just observed=True added :

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))

for col in group_cols:
    df[col] = df[col].astype('category')

df.groupby(group_cols, as_index=False, observed=True).sum()

enter image description here

Solution 2

I was able to get a solution that should work really well. I'll edit my post with a better explanation. But in the mean time, does this work well for you?

import pandas as pd

group_cols = ['Group1', 'Group2', 'Group3']

df = pd.DataFrame([['A', 'B', 'C', 54.34],
                   ['A', 'B', 'D', 61.34],
                   ['B', 'A', 'C', 514.5],
                   ['B', 'A', 'A', 765.4],
                   ['A', 'B', 'D', 765.4]],
                  columns=(group_cols+['Value']))
for col in group_cols:
    df[col] = df[col].astype('category')


result = df.groupby([df[col].values.codes for col in group_cols]).sum()
result = result.reset_index()
level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
result = result.rename(columns=level_to_column_name)
for col in group_cols:
    result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
result

So the answer to this felt more like a proper programming than a normal Pandas question. Under the hood, all categorical series are just a bunch of numbers that index into a name of categories. I did a groupby on these underlying numbers because they don't have the same problem as categorical columns. After doing this I had to rename the columns. I then used the from_codes constructor to create efficiently turn the list of integers back into a categorical column.

Group1  Group2  Group3  Value
A       B       C       54.34
A       B       D       826.74
B       A       A       765.40
B       A       C       514.50

So I understand that this isn't exactly your answer but I've made my solution into a little function for people that have this problem in the future.

def categorical_groupby(df,group_cols,agg_fuction="sum"):
    "Does a groupby on a number of categorical columns"
    result = df.groupby([df[col].values.codes for col in group_cols]).agg(agg_fuction)
    result = result.reset_index()
    level_to_column_name = {f"level_{i}":col for i,col in enumerate(group_cols)}
    result = result.rename(columns=level_to_column_name)
    for col in group_cols:
        result[col] = pd.Categorical.from_codes(result[col].values, categories=df[col].values.categories)
    return result

call it like this:

df.pipe(categorical_groupby,group_cols)

Solution 3

I found the behavior similar to what's documented in the operations section of Categorical Data.

In particular, similar to

In [121]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])

In [122]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})

In [123]: df2.groupby(["cats","B"]).mean()
Out[123]: 
        values
cats B        
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN

Some other words describing the related behavior in Series and groupby. There is also a pivot table example in the end of the section.

Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with categorical data:

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:

Groupby will also show “unused” categories:

The words and the example are cited from Categorical Data.

Solution 4

There is a lot of questions to be answered here.
Let's start by understanding what a 'category' is...

Definition of Categorical dtype

Quoting from pandas docs for "Categorical Data":

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

There are two points I want to focus on here:

  1. The definition of categoricals as a statistical variable:
    basically, this means we have to look at them from a statistical point of view, not the "regular" programming one. i.e. they are not 'enumerates'. Statistical categorical variables has specific operations and usecases, you can read more about them in wikipedia.
    I'll talk more about this after the second point.

  2. Categories are levels in R:
    We can understand more about categoricals if we read about R levels and factors.
    I don't know much about R, but I found this source simple and sufficient. Quoting an interesting example from it:

    When a factor is first created, all of its levels are stored along with the factor, and if subsets of the factor are extracted, they will retain all of the original levels. This can create problems when constructing model matrices and may or may not be useful when displaying the data using, say, the table function. As an example, consider a random sample from the letters vector, which is part of the base R distribution.
    
    > lets = sample(letters,size=100,replace=TRUE)
    > lets = factor(lets)
    > table(lets[1:5])
    
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1
    
    Even though only five of the levels were actually represented, the table function shows the frequencies for all of the levels of the original factors. To change this, we can simply use another call to factor
    
    > table(factor(lets[1:5]))
    
    a k q s z
    1 1 1 1 1
    

Basically this tells us that displaying/using all the categories even if they are not needed is not that uncommon. And actually, it's the default behavior!
This is due to the usual use-cases of categorical variables in statistics. Almost in all the cases you do care about all the categories even if they are not used. Take for example the pandas function cut.

I hope by this point that you understood why this behavior exists in pandas.

GroupBy on Categorical Variables

As of why does groupby consider all the combinations of categories: I can't say for sure, but my best guess based on a quick review of the source code (and the github issue you mentioned), is that they consider the groupby on categorical variables an interaction between them. Hence, It should consider all the pairs/tuples (like a Cartesian product). AFAIK, this helps a lot when you are trying to do something like ANOVA.
This also means that in this context you can't think of it in the usual SQL-like terminology.

Solutions?

Ok, but what if you don't want this behavior?
To the best of my knowledge, and taking into account that I spent the last night tracing this in pandas source code, you can't "disable" it. It's hard coded in every critical step.
However, because of the way groupby works, the actual "expanding" doesn't happen until it's needed. For example, when calling sum over the groups or trying to print them.
Hence, you can do any of the following to get only the needed groups:

df.groupby(group_cols).indices
#{('A', 'B', 'C'): array([0]),
# ('A', 'B', 'D'): array([1, 4]),
# ('B', 'A', 'A'): array([3]),
# ('B', 'A', 'C'): array([2])}

df.groupby(group_cols).groups
#{('A', 'B', 'C'): Int64Index([0], dtype='int64'),
# ('A', 'B', 'D'): Int64Index([1, 4], dtype='int64'),
# ('B', 'A', 'A'): Int64Index([3], dtype='int64'),
# ('B', 'A', 'C'): Int64Index([2], dtype='int64')}

# an example
for g in df.groupby(group_cols).groups:
    print(g, grt.get_group(g).sum()[0])
#('A', 'B', 'C') 54.34
#('A', 'B', 'D') 826.74
#('B', 'A', 'A') 765.4
#('B', 'A', 'C') 514.5

I know this is a no-go for you, but I'm 99% sure that there is no direct way to do this.
I agree that there should be a boolean variable to disable this behavior and use the "regular" SQL-like one.

Solution 5

I found this post while debugging something similar. Very good post, and I really like the inclusion of boundary conditions!

Here's the code that accomplishes the initial goal:

r = df.groupby(group_cols, as_index=False).agg({'Value': 'sum'})

r.columns = ['_'.join(col).strip('_') for col in r.columns]

enter image description here

The downside of this solution is that it results in a hierarchical column index that you may want to flatten (especially if you have multiple statistics). I included flattening of column index in the code above.

I don't know why instance methods:

df.groupby(group_cols).sum() 
df.groupby(group_cols).mean()
df.groupby(group_cols).stdev()

use all unique combinations of categorical variables, while the .agg() method:

df.groupby(group_cols).agg(['count', 'sum', 'mean', 'std']) 

ignores the unused level combinations of the groups. That seems inconsistent. Just happy that we can use the .agg() method and not have to worry about a Cartesian combination explosion.

Also, I think it is very common to have a much lower unique cardinality count vs. the Cartesian product. Think of all the cases where data has columns like "State", "County", 'Zip"... these are all nested variables and many data sets out there have variables that have a high degree of nesting.

In our case the difference between Cartesian product of the grouping variables and the naturally occurring combinations is over 1000x (and the starting data set is over 1,000,000 rows).

Consequently, I would have voted for making observed=True the default behavior.

Share:
15,449
jpp
Author by

jpp

Updated on June 06, 2022

Comments

  • jpp
    jpp almost 2 years

    I am having issues using pandas groupby with categorical data. Theoretically, it should be super efficient: you are grouping and indexing via integers rather than strings. But it insists that, when grouping by multiple categories, every combination of categories must be accounted for.

    I sometimes use categories even when there's a low density of common strings, simply because those strings are long and it saves memory / improves performance. Sometimes there are thousands of categories in each column. When grouping by 3 columns, pandas forces us to hold results for 1000^3 groups.

    My question: is there a convenient way to use groupby with categories while avoiding this untoward behaviour? I'm not looking for any of these solutions:

    • Recreating all the functionality via numpy.
    • Continually converting to strings/codes before groupby, reverting to categories later.
    • Making a tuple column from group columns, then group by the tuple column.

    I'm hoping there's a way to modify just this particular pandas idiosyncrasy. A simple example is below. Instead of 4 categories I want in the output, I end up with 12.

    import pandas as pd
    
    group_cols = ['Group1', 'Group2', 'Group3']
    
    df = pd.DataFrame([['A', 'B', 'C', 54.34],
                       ['A', 'B', 'D', 61.34],
                       ['B', 'A', 'C', 514.5],
                       ['B', 'A', 'A', 765.4],
                       ['A', 'B', 'D', 765.4]],
                      columns=(group_cols+['Value']))
    
    for col in group_cols:
        df[col] = df[col].astype('category')
    
    df.groupby(group_cols, as_index=False).sum()
    
    Group1  Group2  Group3  Value
    #   A   A   A   NaN
    #   A   A   C   NaN
    #   A   A   D   NaN
    #   A   B   A   NaN
    #   A   B   C   54.34
    #   A   B   D   826.74
    #   B   A   A   765.40
    #   B   A   C   514.50
    #   B   A   D   NaN
    #   B   B   A   NaN
    #   B   B   C   NaN
    #   B   B   D   NaN
    

    Bounty update

    The issue is poorly addressed by pandas development team (cf github.com/pandas-dev/pandas/issues/17594). Therefore, I am looking for responses that address any of the following:

    1. Why, with reference to pandas source code, is categorical data treated differently in groupby operations?
    2. Why would the current implementation be preferred? I appreciate this is subjective, but I am struggling to find any answer to this question. Current behaviour is prohibitive in many situations without cumbersome, potentially expensive, workarounds.
    3. Is there a clean solution to override pandas treatment of categorical data in groupby operations? Note the 3 no-go routes (dropping down to numpy; conversions to/from codes; creating and grouping by tuple columns). I would prefer a solution that is "pandas-compliant" to minimise / avoid loss of other pandas categorical functionality.
    4. A response from pandas development team to support and clarify existing treatment. Also, why should considering all category combinations not be configurable as a Boolean parameter?

    Bounty update #2

    To be clear, I'm not expecting answers to all of the above 4 questions. The main question I am asking is whether it's possible, or advisable, to overwrite pandas library methods so that categories are treated in a way that facilitates groupby / set_index operations.

  • jpp
    jpp over 6 years
    It works, but I'm well aware of this option. Convert to codes (better than casting as strings) every time you want to groupby, then convert codes back to categories. I've reworded the second of the 3 "routes I don't want to go down" to make it clearer.
  • jpp
    jpp over 6 years
    I was thinking more along the lines of: "can we easily rewire the pandas group indexer for categoricals" so that all these mappings are not necessary.
  • Gabriel A
    Gabriel A over 6 years
    Ah. Sorry for not noticing. Well, if you have to do this sort of calculation often, you could just make a function with all these steps. It's not perfect but I certainly have a few functions like this. Workarounds that are fast but make the code confusing.
  • jpp
    jpp over 6 years
    Much appreciated, upvoted your answer so others can see how to access category code mappings. For large dataframes and large numbers of groupby operations, the to and fro mappings can become expensive.
  • Gabriel A
    Gabriel A over 6 years
    I'm curious, which step is expensive?
  • jpp
    jpp over 6 years
    Getting from categories to codes is cheap. From codes to categories is relatively expensive.
  • Gabriel A
    Gabriel A about 6 years
    @jp_data_analysis So I just ran a test. The only expensive step is the actual groupby. I agree with you that it is a stupid work around.
  • jpp
    jpp about 6 years
    Yep, I agree with you in relative terms the groupby is more expensive. But it's a pain.. as @Wen pointed out you can't even do df.set_index(group_cols).sum(level=[0,1,2])
  • Gabriel A
    Gabriel A about 6 years
    It has to do with the way that a multi index is set up. Internally, it's basically just a group of categories columns.
  • jpp
    jpp about 6 years
    agreed, but this only happens with "categories multi index" not "string/object type multi index", right?
  • jpp
    jpp about 6 years
    I appreciate this is the current treatment. Yes, it's well documented. There have been one or two people who have asked for more flexible treatment, e.g. see here, but these arguments seem to be summarily brushed aside. Seems like nobody is proposing to overwrite pandas methods such as set_index or groupby to fix this "properly" - probably for good reason.
  • Tai
    Tai about 6 years
    @jp_data_analysis I think perhaps you can try to open an issue in pandas repo. It might be better to gain explanation from them or just propose the feature once again. While there are enough people asking for the feature, they might consider it.
  • ColinMac
    ColinMac almost 4 years
    Wonderful... My 1 categorical field of 9 column-groupby, was multiplying my groupby to 21 TB... yikes. This solved the problem entirely.
  • SteveS
    SteveS over 2 years
    Great, thanks, I ran into the same issue and this parameter solved it!