sql select group by a having count(1) > 1 equivalent in python pandas?

11,472

Solution 1

Instead of writing email_cnt[email_cnt.size > 1], just write email_cnt[email_cnt > 1] (there's no need to call.size again). This uses the Boolean series email_cnt > 1 to return only the relevant values of email_cnt.

For example:

>>> customers = pd.DataFrame({'Email':['foo','bar','foo','foo','baz','bar'],
                              'CustomerID':[1,2,1,2,1,1]})
>>> email_cnt = customers.groupby('Email')['CustomerID'].size()
>>> email_cnt[email_cnt > 1]
Email
bar      2
foo      3
dtype: int64

Solution 2

Two other solutions (with modern "method chain" approach):

Using selection by callable:

customers.groupby('Email').size().loc[lambda x: x>1].sort_values()

Using the query method:

(customers.groupby('Email')['CustomerID'].
    agg([len]).query('len > 1').sort_values('len'))
Share:
11,472
Admin
Author by

Admin

Updated on June 05, 2022

Comments

  • Admin
    Admin almost 2 years

    I'm having a hard time filtering the groupby items in pandas. I want to do

    select email, count(1) as cnt 
    from customers 
    group by email 
    having count(email) > 1 
    order by cnt desc
    

    I did

    customers.groupby('Email')['CustomerID'].size()
    

    and it gives me the list of emails and their respective counts correctly but I am not able to achieve the having count(email) > 1 part.

    email_cnt[email_cnt.size > 1]
    

    returns 1

    email_cnt = customers.groupby('Email')
    email_dup = email_cnt.filter(lambda x:len(x) > 2)
    

    gives the whole record of customers with email > 1 but I want the aggregate table.