Calculate quantile on grouped data in spark Dataframe

apache-spark dataframe pyspark apache-spark-sql

22,521

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

22,521

Author by

chessosapiens

Updated on July 09, 2022

Comments

chessosapiens almost 2 years

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

my desire output would be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I take the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

I am using Spark 2.0.0

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

How to add Extra column with current date in Spark dataframe

Create a dataframe from a list in pyspark.sql

PySpark - Compare DataFrames

PySpark DataFrame - Join on multiple columns dynamically

How to join two data frames in Apache Spark and merge keys into one column?

Compare two dataframes Pyspark

pyspark : Convert DataFrame to RDD[string]

Remove blank space from data frame column values in Spark

Pyspark: Replacing value in a column by searching a dictionary

How to drop multiple column names given in a list from Spark DataFrame?

Calculate quantile on grouped data in spark Dataframe

chessosapiens

Comments

Recents

Related