Spark 2.0 groupBy column and then get max(date) on a datetype column

11,706

RelationalGroupedDataset.max is for numeric values only.

You could try agg() with the accompanying max function. In Scala:

import org.apache.spark.sql.functions._
old_df.groupBy($"ID").agg(max("date"))

so in Java it should be:

import static org.apache.spark.sql.functions.*;
old_df.groupBy("ID").agg(max("date"))
Share:
11,706
ercaran
Author by

ercaran

Updated on July 23, 2022

Comments

  • ercaran
    ercaran almost 2 years

    I'm using Spark 2.0 in Java. I have a Dataset which look like this:

    ------+----+----+----+----------------+
        ID|col1|col2|col3|            date|
    ------+----+----+----+----------------+
    981200| a  | q  | z  |      2010-12-30|
    968800| s  | w  | x  |      2010-12-31|
    966500| d  | e  | c  |      2010-11-02|
    966500| f  | r  | v  |      2010-12-01|
    981200| g  | t  | b  |      2010-12-03|
    

    I wanto to groupBy ID and get only the row with the "last" date (the most recent one). The type of column "date" is date.

    In my case the result would be

    ------+----------------+
        ID|       max(date)|
    ------+----------------+
    981200|      2010-12-30|
    968800|      2010-12-31|
    966500|      2010-12-01|
    

    I tried

        Dataset<Row> df = old_df.groupBy("ID").max("date");
    

    but fails with error

    Exception in thread "main" org.apache.spark.sql.AnalysisException: "date" is not a numeric column. Aggregation function can only be applied on a numeric column.;

    Is there a way to get the max date (using type date) in Spark?

    Thanks in advance.