Format TimestampType in spark DataFrame- Scala
Solution 1
I think what you are missing is that timestamp / datetime fields do NOT have readable formats in native storage. The format is float, or INT96, or some such depending on the database. Formatting a datetime / timestamp for readability has always been a reporting concern (I.E., performed by the tool preparing the data for display), which is why you noticed that when you supplied a string format for the date that it correctly converted it to be stored as a string. The database (spark) only stores exactly what it needs to know exactly what the time value is.
You can specify that a timestamp value does not have milliseconds, I.E., a millisecond value of 0, but not that it should not display milliseconds.
This would be akin to specifying rounding behavior on a numeric column (Also a reporting concern).
Solution 2
You can use unix_timestamp to convert the string date time to timestamp.
unix_timestamp(Column s, String p)
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html ]) to Unix time stamp (in seconds), return null if fail.
val format = "yyyy-MM-dd HH:mm:ss"
dataframe.withColumn("column_A", unix_timestamp($"date", format))
Hope this helps!
SrinR
Updated on July 21, 2022Comments
-
SrinR almost 2 years
While I try to cast a string field to a TimestampType in Spark DataFrame, the output value is coming with microsecond precision(
yyyy-MM-dd HH:mm:ss.S
). But I need the format to beyyyy-MM-dd HH:mm:ss
ie., excluding the microsecond precision. Also, I want to save this as a time stamp field while writing into a parquet file. So the datatype of my field should be a timestamp of formatyyyy-MM-dd HH:mm:ss
I tried using TimestampType as
col("column_A").cast(TimestampType) or col("column_A").cast("timestamp")
to cast the field to timestamp. These are able to cast the field to timestamp but with the microsecond precision.
Can anyone help in saving the timestamp datatype to parquet file with the required format specification.
EDIT
Input:val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb") scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).show(false) +----+---------------------+-------------------+ |cola|colb |datetime | +----+---------------------+-------------------+ |a |2017-01-01 12:02:00.0|2017-01-01 12:02:00| |b |2017-02-01 11:22:30 |2017-02-01 11:22:30| +----+---------------------+-------------------+ scala> a.withColumn("datetime", date_format(col("colb"), "yyyy-MM-dd HH:mm:ss")).printSchema root |-- cola: string (nullable = true) |-- colb: string (nullable = true) |-- datetime: string (nullable = true)
In the above, we are getting the right timestamp format, but when we print the Schema, the datetime field is of type String, but I need a timestamp type here.
Now,if I attempt to cast the field to timestamp, the format is set to microsecond precision, which is not intended.
scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val a = sc.parallelize(List(("a", "2017-01-01 12:02:00.0"), ("b", "2017-02-01 11:22:30"))).toDF("cola", "colb") a: org.apache.spark.sql.DataFrame = [cola: string, colb: string] scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).show(false) +----+---------------------+---------------------+ |cola|colb |datetime | +----+---------------------+---------------------+ |a |2017-01-01 12:02:00.0|2017-01-01 12:02:00.0| |b |2017-02-01 11:22:30 |2017-02-01 11:22:30.0| +----+---------------------+---------------------+ scala> a.withColumn("datetime", date_format(col("colb").cast(TimestampType), "yyyy-MM-dd HH:mm:ss").cast(TimestampType)).printSchema root |-- cola: string (nullable = true) |-- colb: string (nullable = true) |-- datetime: timestamp (nullable = true)
What I am expecting is for the format to be in
yyyy-MM-dd HH:mm:ss
and also the datatype of the field to be oftimestamp
Thanks in advance