Difference between === null and isNull in Spark DataDrame
Solution 1
First and foremost don't use null
in your Scala code unless you really have to for compatibility reasons.
Regarding your question it is plain SQL. col("c1") === null
is interpreted as c1 = NULL
and, because NULL
marks undefined values, result is undefined for any value including NULL
itself.
spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+
The only valid methods to check for NULL
are:
-
IS NULL
:spark.sql("SELECT NULL IS NULL").show
+--------------+ |(NULL IS NULL)| +--------------+ | true| +--------------+
spark.sql("SELECT TRUE IS NULL").show
+--------------+ |(true IS NULL)| +--------------+ | false| +--------------+
-
IS NOT NULL
:spark.sql("SELECT NULL IS NOT NULL").show
+------------------+ |(NULL IS NOT NULL)| +------------------+ | false| +------------------+
spark.sql("SELECT TRUE IS NOT NULL").show
+------------------+ |(true IS NOT NULL)| +------------------+ | true| +------------------+
implemented in DataFrame
DSL as Column.isNull
and Column.isNotNull
respectively.
Note:
For NULL
-safe comparisons use IS DISTINCT
/ IS NOT DISTINCT
:
spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+
or not(_ <=> _)
/ <=>
spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+
in SQL and DataFrame
DSL respectively.
Related:
Including null values in an Apache Spark Join
Solution 2
Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Example extends App {
val session = SparkSession.builder().master("local[*]").getOrCreate()
case class Record(c1: String, c2: String)
val data = List(Record("a", "b"), Record(null, "c"))
val rdd = session.sparkContext.parallelize(data)
import session.implicits._
val df: DataFrame = rdd.toDF
val filtered = df.filter(col("c1") === null)
println(filtered.count()) // <-- outputs 0, not expected
val filtered2 = df.filter(col("c1").isNull)
println(filtered2.count())
println(filtered2) // <- outputs 1, as expected
filtered.explain(true)
filtered2.explain(true)
}
The first explain plan shows:
== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]
This filter clause looks nonsensical. The &&
to null
ensures this can never resolve to true
.
The second explain plan looks like:
== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]
Here the filter is what expect and want.
Related videos on Youtube
John
Updated on July 09, 2022Comments
-
John almost 2 years
I am bit confused with the difference when we are using
df.filter(col("c1") === null) and df.filter(col("c1").isNull)
Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks
-
John over 7 yearsSorry I could not understand one thing.why I am getting some records when I use ===. How come it will be possible @mattinbits
-
mattinbits over 7 yearsDoes your data actually have
null
in that column? -
John over 7 yearsIt is having some empty records(""),But the number of records returning is not matching with empty string when i use === null. It will have both empty and non empty records.But I am getting null pointer exception when calling a udf that's doesn't have a null check.