sql select group by a having count(1) > 1 equivalent in python pandas?

python sql pandas dataframe

11,472

Solution 1

Instead of writing email_cnt[email_cnt.size > 1], just write email_cnt[email_cnt > 1] (there's no need to call.size again). This uses the Boolean series email_cnt > 1 to return only the relevant values of email_cnt.

For example:

>>> customers = pd.DataFrame({'Email':['foo','bar','foo','foo','baz','bar'],
                              'CustomerID':[1,2,1,2,1,1]})
>>> email_cnt = customers.groupby('Email')['CustomerID'].size()
>>> email_cnt[email_cnt > 1]
Email
bar      2
foo      3
dtype: int64

Solution 2

Two other solutions (with modern "method chain" approach):

Using selection by callable:

customers.groupby('Email').size().loc[lambda x: x>1].sort_values()

Using the query method:

(customers.groupby('Email')['CustomerID'].
    agg([len]).query('len > 1').sort_values('len'))

11,472

Author by

Admin

Updated on June 05, 2022

Comments

Admin almost 2 years
I'm having a hard time filtering the groupby items in pandas. I want to do
```
select email, count(1) as cnt 
from customers 
group by email 
having count(email) > 1 
order by cnt desc
```
I did
```
customers.groupby('Email')['CustomerID'].size()
```
and it gives me the list of emails and their respective counts correctly but I am not able to achieve the having count(email) > 1 part.
```
email_cnt[email_cnt.size > 1]
```
returns 1
```
email_cnt = customers.groupby('Email')
email_dup = email_cnt.filter(lambda x:len(x) > 2)
```
gives the whole record of customers with email > 1 but I want the aggregate table.