Pyspark: Convert Column from String Type to Timestamp Type

10,177

You use capital M to identify the months as well as the minutes; the minutes should be identified with m, see here. A working example using to_timestamp is given below, hope this helps!

import pyspark.sql.functions as F

df = sqlContext.createDataFrame(
    [
     ('2016/04/14 190',),
     ('2016/04/15 180',),
     ('2016/04/14 1530',),
     ('2016/04/16 1530',),
     ('2016/04/17 160',)
    ],
    ("TIME",)
)

df.withColumn('TIME_timestamp',F.to_timestamp("TIME", "yyyy/MM/dd HHmm")).show()

Output:

+---------------+-------------------+
|           TIME|     TIME_timestamp|
+---------------+-------------------+
| 2016/04/14 190|2016-04-14 19:00:00|
| 2016/04/15 180|2016-04-15 18:00:00|
|2016/04/14 1530|2016-04-14 15:30:00|
|2016/04/16 1530|2016-04-16 15:30:00|
| 2016/04/17 160|2016-04-17 16:00:00|
+---------------+-------------------+
Share:
10,177
Ahmad Senousi
Author by

Ahmad Senousi

Updated on December 09, 2022

Comments

  • Ahmad Senousi
    Ahmad Senousi over 1 year

    I have been using pyspark 2.3. I have data frame containing 'TIME' column in String format for DateTime values. where the column looks like:

    +---------------+
    |           TIME|
    +---------------+
    | 2016/04/14 190|
    | 2016/04/15 180|
    |2016/04/14 1530|
    |2016/04/16 1530|
    | 2016/04/17 160|
    +---------------+
    

    where the first two digits in 190 or 1530 representing the hour and the rest is the minutes. I tried to convert it to a timestamp type using the following lines:

    df.withColumn('TIME_timestamp',fn.unix_timestamp('TIME','yyyy/MM/dd HHMM').cast(TimestampType()))
    

    and also :

    df.withColumn('TIME_timestamp', fn.to_timestamp("TIME", 'yyyy/MM/dd HHMM'))
    

    but the resulting is:

    +---------------+-------------------+
    |           TIME|     TIME_timestamp|
    +---------------+-------------------+
    | 2016/04/14 190|               null|
    | 2016/04/15 180|               null|
    |2016/04/14 1530|               null|
    |2016/04/16 1530|               null|
    | 2016/04/17 160|               null|
    +---------------+-------------------+
    

    So the desired Dataframe should look like:

    +---------------+
    | TIME_timestamp|
    +---------------+
    | 16-04-15 19:00|
    | 16-04-15 18:00|
    | 16-04-15 15:30|
    | 16-04-15 15:30|
    | 16-04-15 16:00|
    +---------------+