How to count a boolean in grouped Spark data frame
15,638
Probably the simplest solution is a plain CAST
(C style where TRUE
-> 1, FALSE
-> 0) with SUM
:
(data
.groupby("Region")
.agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))
A little bit more universal and idiomatic solution is CASE WHEN
with COUNT
:
(data
.groupby("Region")
.agg(
F.avg("Salary"),
F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))
but here it is clearly an overkill.
Author by
MYjx
Updated on June 07, 2022Comments
-
MYjx almost 2 years
I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a
region
,salary
andIsUnemployed
column withIsUnemployed
as a Boolean. I want to see how many unemployed people in each region. I know we can do afilter
and thengroupby
but I want to generate two aggregation at the same time as belowfrom pyspark.sql import functions as F data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed"))