Collect rows as list with group by apache spark

13,265

Instead of array you can use struct function to combine the columns and use groupBy and collect_list aggregation function as

import org.apache.spark.sql.functions._
df.withColumn("combined", struct("c1","c2","c3","c4","c5"))
    .groupBy("c1").agg(collect_list("combined").as("combined_list"))
    .show(false)

so that you have grouped dataset with schema as

root
 |-- c1: integer (nullable = false)
 |-- combined_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- c1: integer (nullable = false)
 |    |    |-- c2: string (nullable = true)
 |    |    |-- c3: string (nullable = true)
 |    |    |-- c4: string (nullable = true)
 |    |    |-- c5: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: integer (valueContainsNull = false)

I hope the answer is helpful

Share:
13,265
Prateek Jain
Author by

Prateek Jain

Updated on July 01, 2022

Comments

  • Prateek Jain
    Prateek Jain almost 2 years

    I have a particular use case where I have multiple rows for same customer where each row object looks like:

    root
     -c1: BigInt
     -c2: String
     -c3: Double
     -c4: Double
     -c5: Map[String, Int]
    

    Now I have do group by column c1 and collect all the rows as list for same customer like:

    c1, [Row1, Row3, Row4]
    c2, [Row2, Row5]
    

    I tried doing this ways dataset.withColumn("combined", array("c1","c2","c3","c4","c5")).groupBy("c1").agg(collect_list("combined")) but I get an exception:

    Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'array(`c1`, `c2`, `c3`, `c4`, `c5`)' due to data type mismatch: input to function array should all be the same type, but it's [bigint, string, double, double, map<string,map<string,double>>];;