Difference between === null and isNull in Spark DataDrame

49,923

Solution 1

First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

The only valid methods to check for NULL are:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note:

For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

or not(_ <=> _) / <=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

in SQL and DataFrame DSL respectively.

Related:

Including null values in an Apache Spark Join

Solution 2

Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Example extends App {

  val session = SparkSession.builder().master("local[*]").getOrCreate()
  case class Record(c1: String, c2: String)
  val data = List(Record("a", "b"), Record(null, "c"))
  val rdd = session.sparkContext.parallelize(data)
  import session.implicits._

  val df: DataFrame = rdd.toDF
  val filtered = df.filter(col("c1") === null)
  println(filtered.count()) // <-- outputs 0, not expected

  val filtered2 = df.filter(col("c1").isNull)
  println(filtered2.count())
  println(filtered2) // <- outputs 1, as expected

  filtered.explain(true)
  filtered2.explain(true)
}

The first explain plan shows:

== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]

This filter clause looks nonsensical. The && to null ensures this can never resolve to true.

The second explain plan looks like:

== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]

Here the filter is what expect and want.

Share:
49,923

Related videos on Youtube

John
Author by

John

Updated on May 15, 2022

Comments

  • John
    John 13 days

    I am bit confused with the difference when we are using

     df.filter(col("c1") === null) and df.filter(col("c1").isNull) 
    

    Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks

  • John
    John over 5 years
    Sorry I could not understand one thing.why I am getting some records when I use ===. How come it will be possible @mattinbits
  • mattinbits
    mattinbits over 5 years
    Does your data actually have null in that column?
  • John
    John over 5 years
    It is having some empty records(""),But the number of records returning is not matching with empty string when i use === null. It will have both empty and non empty records.But I am getting null pointer exception when calling a udf that's doesn't have a null check.