Why I get null results from date_format() PySpark function?

15,247

It doesn't work because your data is not a valid ISO 8601 representation and cast to date returns NULL:

sqlContext.sql("SELECT CAST('12-21-1991' AS DATE)").show()
## +----+
## | _c0|
## +----+
## |null|
## +----+

You'll have to parse data first using custom format:

output_format = ...  # Some SimpleDateFormat string
df.select(date_format(
    unix_timestamp("dates1", "MM-dd-yyyy").cast("timestamp"), 
    output_format
))
Share:
15,247
Hugo Reyes
Author by

Hugo Reyes

Updated on June 28, 2022

Comments

  • Hugo Reyes
    Hugo Reyes almost 2 years

    Suppose there is a dateframe with a column comprised of dates as strings. For that assumption, we create the following dataFrame as an example:

    # Importing sql types
    from pyspark.sql.types import StringType, IntegerType, StructType, StructField, DoubleType, FloatType, DateType
    from pyspark.sql.functions  import date_format
    import random
    import time
    
    def strTimeProp(start, end, format, prop):
        stime = time.mktime(time.strptime(start, format)) # Parse a string representing a time according to a format
        etime = time.mktime(time.strptime(end, format)) 
        ptime = stime + prop * (etime - stime) 
        return time.strftime(format, time.localtime(ptime))
    
    def randomDate(start, end, prop):
        return strTimeProp(start, end, '%m-%d-%Y', prop)
    
    # Creación de un dataframe de prueba:
    schema = StructType(
            [
         StructField("dates1", StringType(), True),
             StructField("dates2",  StringType(), True)
        ]
    )
    
    size = 32
    numCol1 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
    numCol2 = [str(randomDate("1-1-1991", "1-1-1992", random.random())) for number in range(size)]
    # Building dataFrame:
    sqlContext = SQLContext(sc)
    df = sqlContext.createDataFrame(list(zip(numCol1, numCol2)),schema=schema)
    df.show(5)
    

    In the code above, a random date column is generated, here is an example:

    +----------+----------+
    |    dates1|    dates2|
    +----------+----------+
    |12-21-1991|05-30-1991|
    |05-28-1991|01-23-1991|
    |03-01-1991|08-05-1991|
    |07-15-1991|05-13-1991|
    |07-21-1991|11-10-1991|
    +----------+----------+
    

    What I am trying to do is to change date format with the following code (from pySpark documentation):

    # Changing date formats:
    df.select(date_format('dates1', 'MM-dd-yyy').alias('newFormat')).show(5)
    

    But I get this bad result:

    +---------+
    |newFormat|
    +---------+
    |     null|
    |     null|
    |     null|
    |     null|
    |     null|
    +---------+
    

    I suppose there is a problem relate with the string dataType but at same time, I don't understand why this code bellow works and the code above don't.

    fechas = ['1000-01-01', '1000-01-15']
    df = sqlContext.createDataFrame(list(zip(fechas, fechas)), ['dates', 'd'])
    df.show()
    
    # Changing date formats:
    df.select(date_format('dates', 'MM-dd-yyy').alias('newFormat')).show()
    

    Output:

    +----------+----------+
    |     dates|         d|
    +----------+----------+
    |1000-01-01|1000-01-01|
    |1000-01-15|1000-01-15|
    +----------+----------+
    
    +----------+
    | newFormat|
    +----------+
    |01-01-1000|
    |01-15-1000|
    +----------+
    

    This last results is what I want.