Changing the date format of the column values in aSspark dataframe

19,292

Solution 1

spark >= 2.2.0

You need addtional to_date and to_timestamp inbuilt functions as

import org.apache.spark.sql.functions._
df.withColumn("modified",date_format(to_date(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
  .withColumn("created",to_utc_timestamp(to_timestamp(col("created"), "MM/dd/yy HH:mm"), "UTC"))

and you should have

+----------+-------------------+
|modified  |created            |
+----------+-------------------+
|null      |2017-12-04 13:45:00|
|2018-02-20|2018-02-02 20:50:00|
|2018-03-20|2018-02-02 21:10:00|
|2018-02-20|2018-02-02 21:23:00|
|2018-02-28|2017-12-12 15:42:00|
|2018-01-25|2017-11-09 13:10:00|
|2018-01-29|2017-12-06 10:07:00|
+----------+-------------------+

Use of utc timezone didn't alter the time for me

spark < 2.2.0

import org.apache.spark.sql.functions._
val temp = df.withColumn("modified", from_unixtime(unix_timestamp(col("modified"), "MM/dd/yy"), "yyyy-MM-dd"))
  .withColumn("created", to_utc_timestamp(unix_timestamp(col("created"), "MM/dd/yy HH:mm").cast(TimestampType), "UTC"))

The output dataframe is same as above

Solution 2

Plain and simple:

df.select(
  to_date($"modified", "MM/dd/yy").cast("string").alias("modified"), 
  date_format(to_timestamp($"created", "MM/dd/yy HH:mm"), "yyyy-MM-dd HH:mm").alias("created"))
Share:
19,292
Hemanth
Author by

Hemanth

Updated on June 05, 2022

Comments

  • Hemanth
    Hemanth almost 2 years

    I am reading a Excel sheet into a Dataframe in Spark 2.0 and then trying to convert some columns with date values in MM/DD/YY format into YYYY-MM-DD format. The values are in string format. Below is the sample:

    +---------------+--------------+
    |modified       |      created |
    +---------------+--------------+
    |           null| 12/4/17 13:45|
    |        2/20/18|  2/2/18 20:50|
    |        3/20/18|  2/2/18 21:10|
    |        2/20/18|  2/2/18 21:23|
    |        2/28/18|12/12/17 15:42| 
    |        1/25/18| 11/9/17 13:10|
    |        1/29/18| 12/6/17 10:07| 
    +---------------+--------------+
    

    I would like this to be converted to:

    +---------------+-----------------+
    |modified       |      created    |
    +---------------+-----------------+
    |           null| 2017-12-04 13:45|
    |     2018-02-20| 2018-02-02 20:50|
    |     2018-03-20| 2018-02-02 21:10|
    |     2018-02-20| 2018-02-02 21:23|
    |     2018-02-28| 2017-12-12 15:42| 
    |     2018-01-25| 2017-11-09 13:10|
    |     2018-01-29| 2017-12-06 10:07| 
    +---------------+-----------------+
    

    So I tried doing:

     df.withColumn("modified",date_format(col("modified"),"yyyy-MM-dd"))
       .withColumn("created",to_utc_timestamp(col("created"),"America/New_York"))
    

    But it gives me all NULL values in my result. I am not sure where I am going wrong. I know that to_utc_timestamp on created will convert the whole timestamp into UTC. Ideally I would like to keep the time unchanged and only change the date format. Is there a way to achieve what I am trying to do? and Where am I going wrong?

    Any help would be appreciated. Thank you.

  • Hemanth
    Hemanth about 6 years
    Thanks for the answer! But, when I use to_date it only accepts one argument of the type column. it doesn't accept the pattern string as the second argument.
  • Ramesh Maharjan
    Ramesh Maharjan about 6 years
    its available from 2.2.0 onward
  • Hemanth
    Hemanth about 6 years
    Buy my spark version is 2.0. Should I use a different approach?
  • Hemanth
    Hemanth about 6 years
    I am using spark 2.0. to_timestamp is not available and to_date only accepts one argument, Any other method I can use?