How to calculate the percentage of total in Spark SQL

11,089

For example like this:

val df = sc.parallelize(Seq(
  ("A", 0), ("A", 1), ("A", 0),
  ("B", 0), ("B", 1), ("B", 1)
)).toDF("Name", "Flag")

df.groupBy($"Name").agg(
  count("*").alias("total"),
  sum($"flag").alias("with_flag"),
  // Do you really want to truncate not for example round?
  mean($"flag").multiply(100).cast("integer").alias("percentage"))

// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// |   A|    3|        1|        33|
// |   B|    3|        2|        66|
// +----+-----+---------+----------+

or:

df.registerTempTable("df")
sqlContext.sql("""
  SELECT name, COUNT(*) total, SUM(flag) with_flag, 
         CAST(AVG(flag) * 100 AS INT) percentage
  FROM df
  GROUP BY name""")

// +----+-----+---------+----------+
// |name|total|with_flag|percentage|
// +----+-----+---------+----------+
// |   A|    3|        1|        33|
// |   B|    3|        2|        66|
// +----+-----+---------+----------+
Share:
11,089
BreakPhreak
Author by

BreakPhreak

Updated on June 27, 2022

Comments

  • BreakPhreak
    BreakPhreak almost 2 years

    Considering the following data:

    Name | Flag
    A    | 0
    A    | 1
    A    | 0
    B    | 0
    B    | 1
    B    | 1
    

    I'd like to transform it to:

    Name | Total | With Flag | Percentage
    A    | 3     | 1         | 33%
    B    | 3     | 2         | 66%
    

    Preferably, in Spark SQL.

  • zero323
    zero323 about 8 years
    @LokeshKumarP df.groupBy($"Name").agg(avg(($"Flag" > 50).cast("int")))
  • Lokesh Kumar P
    Lokesh Kumar P about 8 years
    Can you please put in plain english on what the above statement does?
  • zero323
    zero323 about 8 years
    @LokeshKumarP ($"Flag" > 50).cast("int") creates a {0, 1} indicator variable. And the rest is just plain average.