Read and write empty string "" vs NULL in Spark 2.0.1
19,577
A mere two and a half years later, empty strings are no longer considered equal to null
values thanks to Spark 2.4.0! See this commit for a bit of detail on functionality. Your code will behave as expected under 2.4.0+:
val df = session.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""),
(4, null)
))
df.coalesce(1).write.mode("overwrite").format("csv")
.option("delimiter", ",")
.option("nullValue", "unknown")
.option("treatEmptyValuesAsNulls", "false")
.save(s"$path/test")
Results in:
0,a
1,b
2,c
3,
4,unknown
Related videos on Youtube
Author by
Kyro
Updated on June 04, 2022Comments
-
Kyro almost 2 years
CSVFileFormat
seems to read and write empty values as null for string columns. I have searched around but have been unable to find clear information about this, so I put together a simple test.val df = session.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), (4, null) )) df.coalesce(1).write.mode("overwrite").format("csv") .option("delimiter", ",") .option("nullValue", "unknown") .option("treatEmptyValuesAsNulls", "false") .save(s"$path/test")
This outputs:
0,a 1,b 2,c 3,unknown 4,unknown
So, it appears to be treating both empty strings and
null
values asnull
. The same thing happens when reading a CSV file with empty quoted strings and nulls. Is there currently any way to treat these differently?-
Kyro over 7 yearsIt seems they've explicitly set empty string to null in github.com/apache/spark/pull/14118. And there is an open PR to provide for empty string vs null values - github.com/apache/spark/pull/12904.
-
Chitral Verma about 6 yearscan't you simply replace the
null
and empties with different values before writing and after reading ? I think it's also mentioned in one of the links you posted. -
Kyro almost 6 years@Chitral Verma Sure, this is possible, but not always feasible depending on the source of the file, or the destination. It also adds an additional transformation step to every read and write that I think should not be necessary.
-
Chitral Verma almost 6 yearswhenever those changes get merged, the transformation overhead with still be there. It'll just be abstract though. source and destination don't really matter as this problem only pertains to CSVFileFormat
-