pyspark dataframe, groupby and compute variance of a column

13,512

Solution 1

What you can do is convert the dataframe to an RDD object and then use the provided variance function for RDDs.

df1 = df.groupby('country').agg(func.avg('clicks').alias('avg_clicks'))
rdd = df1.rdd
rdd.variance()

Solution 2

As standard deviation is square root of variance a pure PySpark dataframe solution is :

df = sc.parallelize(((.1, 2.0), (.3, .2))).toDF()
df.show()
varianceDF = df.select(stddev('_1') * stddev('_1'))
varianceDF.show()
Share:
13,512
Luca Fiaschi
Author by

Luca Fiaschi

Senior Data Science Executive leading top-performing cross-functional teams at large D2C & SaaS technology companies. Deep theoretical understanding and extensive hands-on experience with the latest techniques from machine learning, deep learning, and big data engineering Track of records delivering end-to-end data science products with millions of euro ROI. Portfolio of projects on all sides of e-commerce activities, such as marketing, logistics, technology, and personalization. Ph.D. in Computer Vision & AI with hundreds of citations in top tier conferences

Updated on June 06, 2022

Comments

  • Luca Fiaschi
    Luca Fiaschi almost 2 years

    I would like to groupby a pyspark dataframe and compute the variance of a specific column. For the average this is quite easy and can be done like this

    from pyspark.sql import functions as func
    AVERAGES=df.groupby('country').agg(func.avg('clicks').alias('avg_clicks')).collect()
    

    however for the variance there seem not to be any aggregation function in the function sub-module (I am also wondering why since this is quite a common operation)