PySpark dataframe convert unusual string format to Timestamp
Solution 1
Spark >= 2.2
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
.show(1, False))
## +-------------------+-------------------+
## |dt |parsed |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+
Spark < 2.2
It is nothing that unix_timestamp
cannot handle:
from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
# For Spark <= 1.5
# See issues.apache.org/jira/browse/SPARK-11724
.cast("double")
.cast("timestamp"))
.show(1, False))
## +-------------------+---------------------+
## |dt |parsed |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+
In both cases the format string should be compatible with Java SimpleDateFormat
.
Solution 2
zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type:
df.withColumn('datetime', col('datetime_str').cast('timestamp'))
It has the advantage of handling milliseconds, while unix_timestamp only has only second-precision (to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated). I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).
Solution 3
I totally agree with the selected answer, however I would like to set the format to 'yyyy_MM_dd HH_mm_ss' so as to avoid problems with timestamps like '2019_01_27 16_00_00' -> Note hour > 12
Solution 4
I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:
import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType
from pyspark.sql.types import StringType
from pyspark.sql.functions import col
sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"},
{"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()
df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()
PR102012
Updated on March 30, 2021Comments
-
PR102012 about 3 years
I am using PySpark through Spark 1.5.0. I have an unusual String format in rows of a column for datetime values. It looks like this:
Row[(datetime='2016_08_21 11_31_08')]
Is there a way to convert this unorthodox
yyyy_mm_dd hh_mm_dd
format into a Timestamp? Something that can eventually come along the lines ofdf = df.withColumn("date_time",df.datetime.astype('Timestamp'))
I had thought that Spark SQL functions like
regexp_replace
could work, but of course I need to replace_
with-
in the date half and_
with:
in the time part.I was thinking I could split the column in 2 using
substring
and count backward from the end of time. Then do the 'regexp_replace' separately, then concatenate. But this seems to many operations? Is there an easier way? -
zero323 over 7 yearsGreat. You can check issues.apache.org/jira/browse/SPARK-11724 for some details.
-
ichbinallen about 4 yearscan I use the
unix_timestamp
function if the timestamp string has timezones? That is, the pattern isyyyy_MM_dd HH_mm_ss z
like1995_05_20 20_30_11 -400
?