Spark 2.0 groupBy column and then get max(date) on a datetype column
11,706
RelationalGroupedDataset.max
is for numeric values only.
You could try agg()
with the accompanying max
function. In Scala:
import org.apache.spark.sql.functions._
old_df.groupBy($"ID").agg(max("date"))
so in Java it should be:
import static org.apache.spark.sql.functions.*;
old_df.groupBy("ID").agg(max("date"))
Author by
ercaran
Updated on July 23, 2022Comments
-
ercaran almost 2 years
I'm using Spark 2.0 in Java. I have a Dataset which look like this:
------+----+----+----+----------------+ ID|col1|col2|col3| date| ------+----+----+----+----------------+ 981200| a | q | z | 2010-12-30| 968800| s | w | x | 2010-12-31| 966500| d | e | c | 2010-11-02| 966500| f | r | v | 2010-12-01| 981200| g | t | b | 2010-12-03|
I wanto to groupBy ID and get only the row with the "last" date (the most recent one). The type of column "date" is date.
In my case the result would be
------+----------------+ ID| max(date)| ------+----------------+ 981200| 2010-12-30| 968800| 2010-12-31| 966500| 2010-12-01|
I tried
Dataset<Row> df = old_df.groupBy("ID").max("date");
but fails with error
Exception in thread "main" org.apache.spark.sql.AnalysisException: "date" is not a numeric column. Aggregation function can only be applied on a numeric column.;
Is there a way to get the max date (using type date) in Spark?
Thanks in advance.