How to convert date to the first day of month in a PySpark Dataframe column?

20,484

You can use trunc:

import pyspark.sql.functions as f

df.withColumn("first_date", f.trunc("date", "month")).show()

+----------+----------+
|      date|first_date|
+----------+----------+
|2017-11-25|2017-11-01|
|2017-12-21|2017-12-01|
|2017-09-12|2017-09-01|
+----------+----------+
Share:
20,484

Related videos on Youtube

Rakesh Adhikesavan
Author by

Rakesh Adhikesavan

I'm a science enthusiast, a technophile, a dog lover and an aspiring Data Scientist.

Updated on January 02, 2021

Comments

  • Rakesh Adhikesavan
    Rakesh Adhikesavan over 3 years

    I have the following DataFrame:

    +----------+
    |      date|
    +----------+
    |2017-01-25|
    |2017-01-21|
    |2017-01-12|
    +----------+
    

    Here is the code the create above DataFrame:

    import pyspark.sql.functions as f
    rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)])
    df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd"))
    df.show()
    

    I want a new column with the first date of month for each row, just replace the day to "01" in all the dates

    +----------++----------+
    |      date| first_date|
    +----------++----------+
    |2017-11-25| 2017-11-01|
    |2017-12-21| 2017-12-01|
    |2017-09-12| 2017-09-01|
    +----------+-----------+
    

    There is a last_day function in PySpark.sql.function, however, there is no first_day function.

    I tried using date_sub to do this but did not work: I get a column not Iterable error because the second argument to date_sub cannot be a column and has to be an integer.

    f.date_sub(f.col('date'), f.dayofmonth(f.col('date')) - 1 )