Format TimestampType in spark DataFrame- Scala

scala apache-spark apache-spark-sql type-conversion

11,362

Solution 1

I think what you are missing is that timestamp / datetime fields do NOT have readable formats in native storage. The format is float, or INT96, or some such depending on the database. Formatting a datetime / timestamp for readability has always been a reporting concern (I.E., performed by the tool preparing the data for display), which is why you noticed that when you supplied a string format for the date that it correctly converted it to be stored as a string. The database (spark) only stores exactly what it needs to know exactly what the time value is.

You can specify that a timestamp value does not have milliseconds, I.E., a millisecond value of 0, but not that it should not display milliseconds.

This would be akin to specifying rounding behavior on a numeric column (Also a reporting concern).

Solution 2

You can use unix_timestamp to convert the string date time to timestamp.

unix_timestamp(Column s, String p) Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ]) to Unix time stamp (in seconds), return null if fail.

val format = "yyyy-MM-dd HH:mm:ss"
dataframe.withColumn("column_A", unix_timestamp($"date", format))

Hope this helps!

11,362

Author by

SrinR

Updated on July 21, 2022

Comments

SrinR almost 2 years

While I try to cast a string field to a TimestampType in Spark DataFrame, the output value is coming with microsecond precision( yyyy-MM-dd HH:mm:ss.S). But I need the format to be yyyy-MM-dd HH:mm:ss ie., excluding the microsecond precision. Also, I want to save this as a time stamp field while writing into a parquet file. So the datatype of my field should be a timestamp of format yyyy-MM-dd HH:mm:ss

I tried using TimestampType as

col("column_A").cast(TimestampType)
or
col("column_A").cast("timestamp")

to cast the field to timestamp. These are able to cast the field to timestamp but with the microsecond precision.

Can anyone help in saving the timestamp datatype to parquet file with the required format specification.
EDIT
Input:

val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).show(false)
+----+---------------------+-------------------+
|cola|colb                 |datetime           |
+----+---------------------+-------------------+
|a   |2017-01-01 12:02:00.0|2017-01-01 12:02:00|
|b   |2017-02-01 11:22:30  |2017-02-01 11:22:30|
+----+---------------------+-------------------+


scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).printSchema
root
 |-- cola: string (nullable = true)
 |-- colb: string (nullable = true)
 |-- datetime: string (nullable = true)

In the above, we are getting the right timestamp format, but when we print the Schema, the datetime field is of type String, but I need a timestamp type here.

Now,if I attempt to cast the field to timestamp, the format is set to microsecond precision, which is not intended.

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb")
a: org.apache.spark.sql.DataFrame = [cola: string, colb: string]

scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).show(false)
+----+---------------------+---------------------+
|cola|colb                 |datetime             |
+----+---------------------+---------------------+
|a   |2017-01-01 12:02:00.0|2017-01-01 12:02:00.0|
|b   |2017-02-01 11:22:30  |2017-02-01 11:22:30.0|
+----+---------------------+---------------------+


scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).printSchema
root
 |-- cola: string (nullable = true)
 |-- colb: string (nullable = true)
 |-- datetime: timestamp (nullable = true)

What I am expecting is for the format to be in yyyy-MM-dd HH:mm:ss and also the datatype of the field to be of timestamp Thanks in advance