Read and write empty string "" vs NULL in Spark 2.0.1

19,577

A mere two and a half years later, empty strings are no longer considered equal to null values thanks to Spark 2.4.0! See this commit for a bit of detail on functionality. Your code will behave as expected under 2.4.0+:

val df = session.createDataFrame(Seq(
    (0, "a"),
    (1, "b"),
    (2, "c"),
    (3, ""),
    (4, null)
))  

df.coalesce(1).write.mode("overwrite").format("csv") 
    .option("delimiter", ",")
    .option("nullValue", "unknown")
    .option("treatEmptyValuesAsNulls", "false")
    .save(s"$path/test")

Results in:

0,a
1,b
2,c
3,
4,unknown
Share:
19,577

Related videos on Youtube

Kyro
Author by

Kyro

Updated on June 04, 2022

Comments

  • Kyro
    Kyro almost 2 years

    CSVFileFormat seems to read and write empty values as null for string columns. I have searched around but have been unable to find clear information about this, so I put together a simple test.

    val df = session.createDataFrame(Seq(
        (0, "a"),
        (1, "b"),
        (2, "c"),
        (3, ""),
        (4, null)
    ))  
    
    df.coalesce(1).write.mode("overwrite").format("csv") 
        .option("delimiter", ",")
        .option("nullValue", "unknown")
        .option("treatEmptyValuesAsNulls", "false")
        .save(s"$path/test")
    

    This outputs:

    0,a
    1,b
    2,c
    3,unknown
    4,unknown
    

    So, it appears to be treating both empty strings and null values as null. The same thing happens when reading a CSV file with empty quoted strings and nulls. Is there currently any way to treat these differently?

    • Kyro
      Kyro over 7 years
      It seems they've explicitly set empty string to null in github.com/apache/spark/pull/14118. And there is an open PR to provide for empty string vs null values - github.com/apache/spark/pull/12904.
    • Chitral Verma
      Chitral Verma about 6 years
      can't you simply replace the null and empties with different values before writing and after reading ? I think it's also mentioned in one of the links you posted.
    • Kyro
      Kyro almost 6 years
      @Chitral Verma Sure, this is possible, but not always feasible depending on the source of the file, or the destination. It also adds an additional transformation step to every read and write that I think should not be necessary.
    • Chitral Verma
      Chitral Verma almost 6 years
      whenever those changes get merged, the transformation overhead with still be there. It'll just be abstract though. source and destination don't really matter as this problem only pertains to CSVFileFormat