PySpark: Take average of a column after using filter function

python apache-spark apache-spark-sql pyspark pyspark-sql

86,971

Solution 1

Aggregation function should be a value and a column name a key:

dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})

Alternatively you can use pyspark.sql.functions:

from pyspark.sql.functions import col, avg

dataframe.filter(df['salary'] > 100000).agg(avg(col("age")))

It is also possible to use CASE .. WHEN

from pyspark.sql.functions import when

dataframe.select(avg(when(df['salary'] > 100000, df['age'])))

Solution 2

You can try this too:

dataframe.filter(df['salary'] > 100000).groupBy().avg('age')

86,971

Related videos on Youtube

Author by

Harit Vishwakarma

Find out more here https://www.linkedin.com/in/harit7/

Updated on May 13, 2020

Comments

Harit Vishwakarma almost 4 years
I am using the following code to get the average age of people whose salary is greater than some threshold.
```
dataframe.filter(df['salary'] > 100000).agg({"avg": "age"})
```
the column age is numeric (float) but still I am getting this error.
```
py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg. 
: scala.MatchError: age (of class java.lang.String)
```
Do you know any other way to obtain the avg etc. without using groupBy function and SQL queries.

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

How to filter a python Spark DataFrame by date between two date format columns

LEFT and RIGHT function in PySpark SQL

pyspark, Compare two rows in dataframe

Whats is the correct way to sum different dataframe columns in a list in pyspark?

Remove an element from a Python list of lists in PySpark DataFrame

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

Why agg() in PySpark is only able to summarize one column at a time?

Pyspark dataframe how to drop rows with nulls in all columns?

PySpark - Add a new column with a Rank by User

Fill Pyspark dataframe column null values with average value from same column