Comparison operator in PySpark (not equal/ !=)

94,395

Solution 1

To filter null values try:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

Solution 2

Why is it not filtering

Because it is SQL and NULL indicates missing values. Because of that any comparison to NULL, other than IS NULL and IS NOT NULL is undefined. You need either:

col("bar").isNull() | (col("bar") != 1)

or

coalesce(col("bar") != 1, lit(True))

or (PySpark >= 2.3):

col("bar").eqNullSafe(1)

if you want null safe comparisons in PySpark.

Also 'null' is not a valid way to introduce NULL literal. You should use None to indicate missing objects.

from pyspark.sql.functions import col, coalesce, lit

df = spark.createDataFrame([
    ('a', 1, 1), ('a',1, None), ('b', 1, 1),
    ('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()

## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## |  a|  1|null|
## |  c|  1|null|
## +---+---+----+
Share:
94,395
Hendrik F
Author by

Hendrik F

Updated on January 16, 2020

Comments

  • Hendrik F
    Hendrik F over 4 years

    I am trying to obtain all rows in a dataframe where two flags are set to '1' and subsequently all those that where only one of two is set to '1' and the other NOT EQUAL to '1'

    With the following schema (three columns),

    df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)],
                                schema=('id', 'foo', 'bar')
                                )
    

    I obtain the following dataframe:

    +---+----+----+
    | id| foo| bar|
    +---+----+----+
    |  a|   1|null|
    |  b|   1|   1|
    |  c|   1|null|
    |  d|null|   1|
    |  e|   1|   1|
    +---+----+----+
    

    When I apply the desired filters, the first filter (foo=1 AND bar=1) works, but not the other (foo=1 AND NOT bar=1)

    foobar_df = df.filter( (df.foo==1) & (df.bar==1) )
    

    yields:

    +---+---+---+
    | id|foo|bar|
    +---+---+---+
    |  b|  1|  1|
    |  e|  1|  1|
    +---+---+---+
    

    Here is the non-behaving filter:

    foo_df = df.filter( (df.foo==1) & (df.bar!=1) )
    foo_df.show()
    +---+---+---+
    | id|foo|bar|
    +---+---+---+
    +---+---+---+
    

    Why is it not filtering? How can I get the columns where only foo is equal to '1'?