Convert String to date SparkSQL
Solution 1
I'd
- Choose more precise data type - here
TimestampType
. -
coalesce
with different formats.
import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+
Solution 2
There are several options to achieve a date parser.
- Use the built in spark sql function
TODATE()
. Here's an example of that implementation. - Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here.
Solution 3
2015-06-11
format is spark.sql.types.DateType
and 2015-06-10 12:02:56
is spark.sql.types.TimestampType
You can't have two dataType on the same column. A schema should have only one dataType for each columns.
I would suggest you to create two new columns and have the format you desire in them as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
this will give you dataframe
as
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
You can see that the dataType is different
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
I hope the answer is helpful
Related videos on Youtube
user1363308
Updated on May 25, 2022Comments
-
user1363308 almost 2 years
I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.
df.printSchema root |-- TYPE: string (nullable = true) |-- CODE: string (nullable = true) |-- SQ_CODE: string (nullable = true) |-- RE_TYPE: string (nullable = true) |-- VERY_ID: long (nullable = true) |-- IN_DATE: string (nullable = true) df.show Input +-----+-------+---------+---------+-------------------+-----------------+ | TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE | +-----+-------+---------+---------+-------------------+-----------------+ | F | 000544| 2017002| OP | 95032015062763298| 20150610120256 | | F | 000544| 2017002| LD | 95032015062763261| 20150611 | | F | 000544| 2017002| AK | 95037854336743246| 20150611012356 | +-----+-------+---------+--+------+-------------------+-----------------+ df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) , to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date"))) .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp"))) Actual output +-----+-------+---------+---------+-------------------+----------------------+ | TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE | +-----+-------+---------+---------+-------------------+----------------------+ | F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 | | F | 000544| 2017002| LD | 95032015062763261| null | | F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 | +-----+-------+---------+--+------+-------------------+----------------------+ df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) , to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp"))) .otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp"))) Actual output +-----+-------+---------+---------+-------------------+----------------------+ | TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE | +-----+-------+---------+---------+-------------------+----------------------+ | F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 | | F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 | | F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 | +-----+-------+---------+--+------+-------------------+----------------------+ Expected output +-----+-------+---------+---------+-------------------+----------------------+ | TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE | +-----+-------+---------+---------+-------------------+----------------------+ | F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 | | F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 | | F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 | +-----+-------+---------+--+------+-------------------+----------------------+
-
Ramesh Maharjan over 6 yearsThat would be impossible as both dates datatype is different. one is TimestampType and the other is DateType and you can't really have two schemas for the same column.
-
-
philantrovert over 6 yearsWhy the
regexp_replace
? We havetrim
in spark sql.