How to get number of groups in a groupby object in pandas?
Solution 1
As documented, you can get the number of groups with len(dfgroup)
.
Solution 2
[pandas >= 0.23] Simple, Fast, and Pandaic: ngroups
Newer versions of the groupby API provide this (undocumented) attribute which stores the number of groups in a GroupBy object.
# setup
df = pd.DataFrame({'A': list('aabbcccd')})
dfg = df.groupby('A')
# call `.ngroups` on the GroupBy object
dfg.ngroups
# 4
Note that this is different from GroupBy.groups
which returns the actual groups themselves.
Why should I prefer this over len
?
As noted in BrenBarn's answer, you could use len(dfg)
to get the number of groups. But you shouldn't. Looking at the implementation of GroupBy.__len__
(which is what len()
calls interally), we see that __len__
makes a call to GroupBy.groups
, which returns a dictionary of grouped indices:
dfg.groups
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2, 3], dtype='int64'),
'c': Int64Index([4, 5, 6], dtype='int64'),
'd': Int64Index([7], dtype='int64')}
Depending on the number of groups in your operation, generating the dictionary only to find its length is a wasteful step. ngroups
on the other hand is a stored property that can be accessed in constant time.
This has been documented in GroupBy
object attributes. The issue with len
, however, is that for a GroupBy object with a lot of groups, this can take a lot longer
But what if I actually want the size of each group?
You're in luck. We have a function for that, it's called GroupBy.size
. But please note that size
counts NaNs as well. If you don't want NaNs counted, use GroupBy.count
instead.
wolfsatthedoor
Updated on July 05, 2022Comments
-
wolfsatthedoor almost 2 years
This would be useful so I know how many unique groups I have to perform calculations on. Thank you.
Suppose groupby object is called
dfgroup
. -
cs95 almost 5 years@U9-Forward Thanks! It isn't a popular question (relatively speaking) but I assume the upvotes here mean the answer is useful. I still feel like I can make improvements so I'll look into that in a bit.
-
U12-Forward almost 5 yearsYou deserve a little more i guess,
ngroups
is clever :-) -
BTR over 4 yearsNote
len(g)
can be VERY slow the first time it is called if there are a large number of groups!! IPython caches the result thereafter, butg.ngroups
is always fast since it is stored as an attribute. -
Shuchita Banthia over 4 yearsAs noted below, using
len(dfgroup)
can be very slow, especially for large number of groups.dfgroup.ngroups
is the fastest way to get this, as this is a stored value!