Pandas 'count(distinct)' equivalent
Solution 1
I believe this is what you want:
table.groupby('YEARMONTH').CLIENTCODE.nunique()
Example:
In [2]: table
Out[2]:
CLIENTCODE YEARMONTH
0 1 201301
1 1 201301
2 2 201301
3 1 201302
4 2 201302
5 2 201302
6 3 201302
In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]:
YEARMONTH
201301 2
201302 3
Solution 2
Here is another method and it is much simpler. Let’s say your dataframe name is daat
and the column name is YEARMONTH
:
daat.YEARMONTH.value_counts()
Solution 3
Interestingly enough, very often len(unique())
is a few times (3x-15x) faster than nunique()
.
Solution 4
I am also using nunique
but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean'
etc.
df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min') #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max') #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean') #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count') #count
Solution 5
Distinct of column along with aggregations on other columns
To get the distinct number of values for any column (CLIENTCODE
in your case), we can use nunique
. We can pass the input as a dictionary in agg
function, along with aggregations on other columns:
grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
'other_col_1': ['sum', 'count']})
# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]
# if you wish to reset the index
grp_df.reset_index(inplace=True)
Adriano Almeida
Pythonist, Machine Learnist and Business Intelligencist
Updated on July 25, 2021Comments
-
Adriano Almeida almost 3 years
I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.
I have a table loaded in a DataFrame with some columns:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
In SQL, to count the amount of different clients per year would be:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
And the result would be
201301 5000 201302 13245
How can I do that in Pandas?