Calculate quantile on grouped data in spark Dataframe

22,521

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+ 

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

Share:
22,521
chessosapiens
Author by

chessosapiens

Updated on July 09, 2022

Comments

  • chessosapiens
    chessosapiens almost 2 years

    I have the following Spark dataframe :

     agent_id|payment_amount|
    +--------+--------------+
    |       a|          1000|
    |       b|          1100|
    |       a|          1100|
    |       a|          1200|
    |       b|          1200|
    |       b|          1250|
    |       a|         10000|
    |       b|          9000|
    +--------+--------------+
    

    my desire output would be something like

    agen_id   95_quantile
      a          whatever is 95 quantile for agent a payments
      b          whatever is 95 quantile for agent b payments
    

    for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

    test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
    

    but I take the following error:

    'GroupedData' object has no attribute 'approxQuantile'
    

    I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

    I am using Spark 2.0.0