How to calculate the percentage of total in Spark SQL
11,089
For example like this:
val df = sc.parallelize(Seq(
("A", 0), ("A", 1), ("A", 0),
("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")
df.groupBy($"Name").agg(
count("*").alias("total"),
sum($"flag").alias("with_flag"),
// Do you really want to truncate not for example round?
mean($"flag").multiply(100).cast("integer").alias("percentage"))
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
or:
df.registerTempTable("df")
sqlContext.sql("""
SELECT name, COUNT(*) total, SUM(flag) with_flag,
CAST(AVG(flag) * 100 AS INT) percentage
FROM df
GROUP BY name""")
// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// | A| 3| 1| 33|
// | B| 3| 2| 66|
// +----+-----+---------+----------+
Author by
BreakPhreak
Updated on June 27, 2022Comments
-
BreakPhreak almost 2 years
Considering the following data:
Name | Flag A | 0 A | 1 A | 0 B | 0 B | 1 B | 1
I'd like to transform it to:
Name | Total | With Flag | Percentage A | 3 | 1 | 33% B | 3 | 2 | 66%
Preferably, in Spark SQL.
-
zero323 about 8 years@LokeshKumarP
df.groupBy($"Name").agg(avg(($"Flag" > 50).cast("int")))
-
Lokesh Kumar P about 8 yearsCan you please put in plain english on what the above statement does?
-
zero323 about 8 years@LokeshKumarP
($"Flag" > 50).cast("int")
creates a{0, 1}
indicator variable. And the rest is just plain average.