How to count a boolean in grouped Spark data frame

python sql apache-spark pyspark apache-spark-sql

15,638

Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:

(data
    .groupby("Region")
    .agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))

A little bit more universal and idiomatic solution is CASE WHEN with COUNT:

(data
    .groupby("Region")
    .agg(
        F.avg("Salary"),
        F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))

but here it is clearly an overkill.

15,638

Author by

MYjx

Updated on June 07, 2022

Comments

MYjx almost 2 years
I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed column with IsUnemployed as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter and then groupby but I want to generate two aggregation at the same time as below
```
from pyspark.sql import functions as F  
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed")) 
```

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

Sparksql filtering (selecting with where clause) with multiple conditions

How to use a Scala class inside Pyspark

How to use correlation in Spark with Dataframes?

PySpark: How to check if list of string values exists in dataframe and print values to a list

PySpark: TypeError: StructType can not accept object 0.10000000000000001 in type <type 'numpy.float64'>

How to filter a python Spark DataFrame by date between two date format columns

SparkSQL: conditional sum using two columns

LEFT and RIGHT function in PySpark SQL

conditional aggregation using pyspark

Create a dataframe from a list in pyspark.sql