Spark 2.0 groupBy column and then get max(date) on a datetype column

java apache-spark apache-spark-sql

11,706

RelationalGroupedDataset.max is for numeric values only.

You could try agg() with the accompanying max function. In Scala:

import org.apache.spark.sql.functions._
old_df.groupBy($"ID").agg(max("date"))

so in Java it should be:

import static org.apache.spark.sql.functions.*;
old_df.groupBy("ID").agg(max("date"))

11,706

Author by

ercaran

Updated on July 23, 2022

Comments

ercaran almost 2 years

I'm using Spark 2.0 in Java. I have a Dataset which look like this:

------+----+----+----+----------------+
    ID|col1|col2|col3|            date|
------+----+----+----+----------------+
981200| a  | q  | z  |      2010-12-30|
968800| s  | w  | x  |      2010-12-31|
966500| d  | e  | c  |      2010-11-02|
966500| f  | r  | v  |      2010-12-01|
981200| g  | t  | b  |      2010-12-03|

I wanto to groupBy ID and get only the row with the "last" date (the most recent one). The type of column "date" is date.

In my case the result would be

------+----------------+
    ID|       max(date)|
------+----------------+
981200|      2010-12-30|
968800|      2010-12-31|
966500|      2010-12-01|

I tried

    Dataset<Row> df = old_df.groupBy("ID").max("date");

but fails with error

Exception in thread "main" org.apache.spark.sql.AnalysisException: "date" is not a numeric column. Aggregation function can only be applied on a numeric column.;

Is there a way to get the max date (using type date) in Spark?

Thanks in advance.