Calculate percentile on pyspark dataframe columns

13,815
df.selectExpr('percentile(MOU_G_EDUCATION_ADULT, 0.95)').show()

for large datasets consider using percentile_approx()

Share:
13,815
Wendy De Wit
Author by

Wendy De Wit

Updated on June 20, 2022

Comments

  • Wendy De Wit
    Wendy De Wit almost 2 years

    I have a PySpark dataframe which contains an ID and then a couple of variables for which I want to calculate the 95% point.

    Part of the printSchema():

    root
     |-- ID: string (nullable = true)
     |-- MOU_G_EDUCATION_ADULT: double (nullable = false)
     |-- MOU_G_EDUCATION_KIDS: double (nullable = false)
    

    I found How to derive Percentile using Spark Data frame and GroupBy in python, but this fails with an error message:

    perc95_udf = udf(lambda x: x.quantile(.95))
    
    
    fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", perc95_udf('MOU_G_EDUCATION_ADULT')) \
                          .withColumn("P95_MOU_G_EDUCATION_KIDS", perc95_udf('MOU_G_EDUCATION_KIDS'))
    
    fanscores.take(2) 
    

    AttributeError: 'float' object has no attribute 'quantile'

    Other UDF trials I already tried:

    def percentile(quantiel,kolom):
        x=np.array(kolom)
        perc=np.percentile(x, quantiel)
        return perc
    
    percentile_udf = udf(percentile, LongType())
    
    
    fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_ADULT)) \
                      .withColumn("P95_MOU_G_EDUCATION_KIDS", percentile_udf(quantiel=95, kolom=genres.MOU_G_EDUCATION_KIDS))
    
    fanscores.take(2)   
    

    gives the error: "TypeError: wrapper() got an unexpected keyword argument 'quantiel'"

    My final trial:

    import numpy as np
    
    def percentile(quantiel):
        return udf(lambda kolom: np.percentile(np.array(kolom), quantiel))
    
    fanscores = genres.withColumn("P95_MOU_G_EDUCATION_ADULT", percentile(quantiel=95)(genres.MOU_G_EDUCATION_ADULT)) \
                      .withColumn("P95_MOU_G_EDUCATION_KIDS", percentile(quantiel=95) (genres.MOU_G_EDUCATION_KIDS))
    
    fanscores.take(2)  
    

    Gives the error:

    PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

    How could I solve this ?