Pandas 'count(distinct)' equivalent

566,181

Solution 1

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Solution 2

Here is another method and it is much simpler. Let’s say your dataframe name is daat and the column name is YEARMONTH:

daat.YEARMONTH.value_counts()

Solution 3

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

Solution 4

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

Solution 5

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column (CLIENTCODE in your case), we can use nunique. We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)
Share:
566,181
Adriano Almeida
Author by

Adriano Almeida

Pythonist, Machine Learnist and Business Intelligencist

Updated on July 25, 2021

Comments

  • Adriano Almeida
    Adriano Almeida almost 3 years

    I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.

    I have a table loaded in a DataFrame with some columns:

    YEARMONTH, CLIENTCODE, SIZE, etc., etc.
    

    In SQL, to count the amount of different clients per year would be:

    SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
    

    And the result would be

    201301    5000
    201302    13245
    

    How can I do that in Pandas?