Partition of Timestamp column in Dataframes Pyspark

12,876

Spark >= 3.1

Instead of cast use timestamp_seconds

from pyspark.sql.functions import timestamp_seconds

year(timestamp_seconds(col("timestamp")))

Spark < 3.1

Just extract fields you want to use and provide a list of columns as an argument to the partitionBy of the writer. If timestamp is UNIX timestamps expressed in seconds:

df = sc.parallelize([
    (1484810378, 1, "sam", 8, 102, "It"),
    (1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])

add columns:

from pyspark.sql.functions import year, month, col

df_with_year_and_month = (df
    .withColumn("year", year(col("timestamp").cast("timestamp")))
    .withColumn("month", month(col("timestamp").cast("timestamp"))))

and write:

(df_with_year_and_month
    .write
    .partitionBy("year", "month")
    .mode("overwrite")
    .format("parquet")
    .saveAsTable("default.testing"))
Share:
12,876
User12345
Author by

User12345

Updated on June 05, 2022

Comments

  • User12345
    User12345 almost 2 years

    I have a DataFrame in PSspark in the below format

    Date        Id  Name    Hours   Dno Dname
    12/11/2013  1   sam     8       102 It
    12/10/2013  2   Ram     7       102 It
    11/10/2013  3   Jack    8       103 Accounts
    12/11/2013  4   Jim     9       101 Marketing
    

    I want to do partition based on dno and save as table in Hive using Parquet format.

    df.write.saveAsTable(
        'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')
    

    The query worked fine and created table in Hive with Parquet input.

    Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp

    how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark