Spark : How to group by distinct values in DataFrame

10,506

Try collect_set function inside agg()

val df = sc.parallelize(Seq(
  (1,3), (1,6), (1,5), (2,1),(2,4)
  (2,1))).toDF("a","b")

+---+---+
|  a|  b|
+---+---+
|  1|  3|
|  1|  6|
|  1|  5|
|  2|  1|
|  2|  4|
|  2|  1|
+---+---+

val df2 = df.groupBy("a").agg(collect_set("b")).show()

+---+--------------+
|  a|collect_set(b)|
+---+--------------+
|  1|     [3, 6, 5]|
|  2|        [1, 4]|
+---+--------------+

And if you want duplicate entries , can use collect_list

val df3 = df.groupBy("a").agg(collect_list("b")).show() 

+---+---------------+
|  a|collect_list(b)|
+---+---------------+
|  1|      [3, 6, 5]|
|  2|      [1, 4, 1]|
+---+---------------+
Share:
10,506
ts178
Author by

ts178

Updated on June 29, 2022

Comments

  • ts178
    ts178 over 1 year

    I have a data in a file in the following format:

    1,32    
    1,33
    1,44
    2,21
    2,56
    1,23
    

    The code I am executing is following:

    val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
    
    import spark.implicits._
    import sqlContext.implicits._
    
    case class Person(a: Int, b: Int)
    
    val ppl = sc.textFile("newfile.txt").map(_.split(","))
        .map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
        .toDF()
    ppl.registerTempTable("people")
    
    val result = ppl.select("a","b").groupBy('a).agg()
    result.show
    

    Expected Output is:

    a 32, 33, 44, 23
    
    b 21, 56 
    

    Instead of aggregation by sum, count, mean etc. I want every element in the row.