how to get max(date) from given set of data grouped by some fields using pyspark?

64,172

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_

df = sc.parallelize([
    ("2016-04-06 16:36", 1234, 111, 1),
    ("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])

(df.withColumn("datetime", col("datetime").cast("timestamp"))
    .groupBy("userId", "memberId")
    .agg(max_("datetime")))

## +------+--------+--------------------+
## |userId|memberId|       max(datetime)|
## +------+--------+--------------------+
## |  1234|     111|2016-04-06 17:35:...|
## +------+--------+--------------------+
Share:
64,172
Bhuvan
Author by

Bhuvan

Updated on December 15, 2021

Comments

  • Bhuvan
    Bhuvan over 2 years

    I have the data in the dataframe as below:

      datetime             | userId | memberId | value |    
    2016-04-06 16:36:...   | 1234   | 111      | 1
    2016-04-06 17:35:...   | 1234   | 222      | 5
    2016-04-06 17:50:...   | 1234   | 111      | 8
    2016-04-06 18:36:...   | 1234   | 222      | 9
    2016-04-05 16:36:...   | 4567   | 111      | 1
    2016-04-06 17:35:...   | 4567   | 222      | 5
    2016-04-06 18:50:...   | 4567   | 111      | 8
    2016-04-06 19:36:...   | 4567   | 222      | 9
    

    I need to find the max(datetime) groupby userid,memberid. When I tried as below:

    df2 = df.groupBy('userId','memberId').max('datetime')
    

    I'm getting error as:

    org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
    column. Aggregation function can only be applied on a numeric column.;
    

    The output I desired is as follows:

    userId | memberId | datetime
    1234   |  111     | 2016-04-06 17:50:...
    1234   |  222     | 2016-04-06 18:36:...
    4567   |  111     | 2016-04-06 18:50:...
    4567   |  222     | 2016-04-06 19:36:...
    

    Can someone please help me how I get the max date among the given data using PySpark dataframes?