pyspark dataframe, groupby and compute variance of a column
Solution 1
What you can do is convert the dataframe to an RDD object and then use the provided variance function for RDDs.
df1 = df.groupby('country').agg(func.avg('clicks').alias('avg_clicks'))
rdd = df1.rdd
rdd.variance()
Solution 2
As standard deviation is square root of variance a pure PySpark dataframe solution is :
df = sc.parallelize(((.1, 2.0), (.3, .2))).toDF()
df.show()
varianceDF = df.select(stddev('_1') * stddev('_1'))
varianceDF.show()
Luca Fiaschi
Senior Data Science Executive leading top-performing cross-functional teams at large D2C & SaaS technology companies. Deep theoretical understanding and extensive hands-on experience with the latest techniques from machine learning, deep learning, and big data engineering Track of records delivering end-to-end data science products with millions of euro ROI. Portfolio of projects on all sides of e-commerce activities, such as marketing, logistics, technology, and personalization. Ph.D. in Computer Vision & AI with hundreds of citations in top tier conferences
Updated on June 06, 2022Comments
-
Luca Fiaschi almost 2 years
I would like to groupby a pyspark dataframe and compute the variance of a specific column. For the average this is quite easy and can be done like this
from pyspark.sql import functions as func AVERAGES=df.groupby('country').agg(func.avg('clicks').alias('avg_clicks')).collect()
however for the variance there seem not to be any aggregation function in the function sub-module (I am also wondering why since this is quite a common operation)