Convert String to date SparkSQL

11,210

Solution 1

I'd

  • Choose more precise data type - here TimestampType.
  • coalesce with different formats.
import org.apache.spark.sql.functions._

val df = Seq("20150610120256", "20150611").toDF("IN_DATE")

df.withColumn("IN_DATE", coalesce(
  to_timestamp($"IN_DATE", "yyyyMMddHHmmss"), 
  to_timestamp($"IN_DATE", "yyyyMMdd"))).show


+-------------------+
|            IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+

Solution 2

There are several options to achieve a date parser.

  1. Use the built in spark sql function TODATE(). Here's an example of that implementation.
  2. Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here.

Solution 3

2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType

You can't have two dataType on the same column. A schema should have only one dataType for each columns.

I would suggest you to create two new columns and have the format you desire in them as

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
  .withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)) 

this will give you dataframe as

+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE  |SQ_CODE|RE_TYPE|VERY_ID          |IN_DATE       |IN_DATE_DateOnly|IN_DATE_DateAndTime  |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F   |000544|2017002|OP     |95032015062763298|20150610120256|null            |2015-06-10 12:02:00.0|
|F   |000544|2017002|LD     |95032015062763261|20150611      |2015-06-11      |null                 |
|F   |000544|2017002|AK     |95037854336743246|20150611012356|null            |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+

You can see that the dataType is different

root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: string (nullable = true)
 |-- IN_DATE: string (nullable = true)
 |-- IN_DATE_DateOnly: date (nullable = true)
 |-- IN_DATE_DateAndTime: timestamp (nullable = true)

I hope the answer is helpful

Share:
11,210

Related videos on Youtube

user1363308
Author by

user1363308

Updated on May 25, 2022

Comments

  • user1363308
    user1363308 almost 2 years

    I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below

    1)    yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 
    2)    yyyyMMddHHmmss(20150611      ) ->yyyy-MM-dd(2015-06-11)
    

    First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.

    df.printSchema
    root
     |-- TYPE: string (nullable = true)
     |-- CODE: string (nullable = true)
     |-- SQ_CODE: string (nullable = true)
     |-- RE_TYPE: string (nullable = true)
     |-- VERY_ID: long (nullable = true)
     |-- IN_DATE: string (nullable = true)
    
    
    df.show
    Input  
    +-----+-------+---------+---------+-------------------+-----------------+
    | TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE        |
    +-----+-------+---------+---------+-------------------+-----------------+
    |   F | 000544|  2017002|      OP |  95032015062763298| 20150610120256  |
    |   F | 000544|  2017002|      LD |  95032015062763261| 20150611        |
    |   F | 000544|  2017002|      AK |  95037854336743246| 20150611012356  |
    +-----+-------+---------+--+------+-------------------+-----------------+
    
    df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
            to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
            .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
    
    Actual output
    +-----+-------+---------+---------+-------------------+----------------------+
    | TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
    +-----+-------+---------+---------+-------------------+----------------------+
    |   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
    |   F | 000544|  2017002|      LD |  95032015062763261| null                 |
    |   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
    +-----+-------+---------+--+------+-------------------+----------------------+
    
    df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
            to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
            .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
    
    Actual output
    +-----+-------+---------+---------+-------------------+----------------------+
    | TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
    +-----+-------+---------+---------+-------------------+----------------------+
    |   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
    |   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11 00:00:00  |
    |   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
    +-----+-------+---------+--+------+-------------------+----------------------+
    
    
    Expected output
    +-----+-------+---------+---------+-------------------+----------------------+
    | TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE             |
    +-----+-------+---------+---------+-------------------+----------------------+
    |   F | 000544|  2017002|      OP |  95032015062763298| 2015-06-10 12:02:56  |
    |   F | 000544|  2017002|      LD |  95032015062763261| 2015-06-11           |
    |   F | 000544|  2017002|      AK |  95037854336743246| 2015-06-11 01:23:56  |
    +-----+-------+---------+--+------+-------------------+----------------------+
    
    • Ramesh Maharjan
      Ramesh Maharjan over 6 years
      That would be impossible as both dates datatype is different. one is TimestampType and the other is DateType and you can't really have two schemas for the same column.
  • philantrovert
    philantrovert over 6 years
    Why the regexp_replace? We have trim in spark sql.