Calculate quantile on grouped data in spark Dataframe
One solution would be to use percentile_approx
:
>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")
>>> df2.show()
# +--------+-----------------+
# |agent_id| approxQuantile|
# +--------+-----------------+
# | a|8239.999999999998|
# | b|7449.999999999998|
# +--------+-----------------+
Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext
.
Note 2 : approxQuantile
isn't available in Spark < 2.0 for pyspark
.
Note 3 : percentile
returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.
EDIT : From Spark 2+, HiveContext
is not required.
chessosapiens
Updated on July 09, 2022Comments
-
chessosapiens almost 2 years
I have the following Spark dataframe :
agent_id|payment_amount| +--------+--------------+ | a| 1000| | b| 1100| | a| 1100| | a| 1200| | b| 1200| | b| 1250| | a| 10000| | b| 9000| +--------+--------------+
my desire output would be something like
agen_id 95_quantile a whatever is 95 quantile for agent a payments b whatever is 95 quantile for agent b payments
for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:
test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)
but I take the following error:
'GroupedData' object has no attribute 'approxQuantile'
I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes
I am using Spark 2.0.0