How to detect null column in pyspark

14,631

Solution 1

Extend the condition to

from pyspark.sql.functions import min, max

((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c) 

or use eqNullSafe (PySpark 2.3):

(min(c).eqNullSafe(max(c))).alias(c) 

Solution 2

How about this? In order to guarantee the column are all nulls, two properties must be satisfied:

(1) The min value is equal to the max value

(2) The min or max is null

Or, equivalently

(1) The min AND max are both equal to None

Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1] would be incorrectly reported since the min and max will be 1.

import pyspark.sql.functions as F

def get_null_column_names(df):
    column_names = []

    for col_name in df.columns:

        min_ = df.select(F.min(col_name)).first()[0]
        max_ = df.select(F.max(col_name)).first()[0]

        if min_ is None and max_ is None:
            column_names.append(col_name)

    return column_names

Here's an example in practice:

>>> rows = [(None, 18, None, None),
            (1, None, None, None),
            (1, 9, 4.0, None),
            (None, 0, 0., None)]

>>> schema = "a: int, b: int, c: float, d:int"

>>> df = spark.createDataFrame(data=rows, schema=schema)

>>> df.show()

+----+----+----+----+
|   a|   b|   c|   d|
+----+----+----+----+
|null|  18|null|null|
|   1|null|null|null|
|   1|   9| 4.0|null|
|null|   0| 0.0|null|
+----+----+----+----+

>>> get_null_column_names(df)
['d']

Solution 3

One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:

spark.version
# u'2.2.0'

from pyspark.sql.functions import col

nullColumns = []
numRows = df.count()
for k in df.columns:
  nullRows = df.where(col(k).isNull()).count()
  if nullRows ==  numRows: # i.e. if ALL values are NULL
    nullColumns.append(k)

nullColumns
# ['D']

But there is a simpler way: it turns out that the function countDistinct, when applied to a column with all NULL values, returns zero (0):

from pyspark.sql.functions import countDistinct

df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]

So the for loop now can be:

nullColumns = []
for k in df.columns:
  if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
    nullColumns.append(k)

nullColumns
# ['D']

UPDATE (after comments): It seems possible to avoid collect in the second solution; since df.agg returns a dataframe with only one row, replacing collect with take(1) will safely do the job:

nullColumns = []
for k in df.columns:
  if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
    nullColumns.append(k)

nullColumns
# ['D']
Share:
14,631

Related videos on Youtube

Mehdi Ben Hamida
Author by

Mehdi Ben Hamida

Updated on May 27, 2022

Comments

  • Mehdi Ben Hamida
    Mehdi Ben Hamida over 1 year

    I have a dataframe defined with some null values. Some Columns are fully null values.

    >> df.show()
    +---+---+---+----+
    |  A|  B|  C|   D|
    +---+---+---+----+
    |1.0|4.0|7.0|null|
    |2.0|5.0|7.0|null|
    |3.0|6.0|5.0|null|
    +---+---+---+----+
    

    In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).

    this is how I did it:

    nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const] 
    

    but this does no consider null columns as constant, it works only with values. How should I then do it ?

  • desertnaut
    desertnaut almost 6 years
    @MehdiBenHamida Careful! collect here is not applied to the whole of your data, but to an aggregation. Arguably, this is a legitimate use of collect.
  • Mehdi Ben Hamida
    Mehdi Ben Hamida almost 6 years
    I know that collect is about the aggregation but still consuming a lot of performance :/
  • desertnaut
    desertnaut almost 6 years
    @MehdiBenHamida perhaps you have not realized that what you ask is not at all trivial: one way or another, you'll have to go through all your rows, checking for NULLs; now, if you are not happy even to use countDistinct... (as if min and max would take less time, if applicable)
  • Mehdi Ben Hamida
    Mehdi Ben Hamida almost 6 years
    this will consume a lot time to detect all null columns, I think there is a better alternative. For the first suggested solution, I tried it; it better than the second one but still taking too much time. I think, there is a better alternative!
  • Mehdi Ben Hamida
    Mehdi Ben Hamida almost 6 years
    @desertnaut: this is a pretty faster, takes only decim seconds :D
  • matt
    matt over 5 years
    This works for the case when all values in the column are null. But consider the case with column values of [null, 1, 1, null] . In this case, the min and max will both equal 1 . Thus, will get identified incorrectly as having all nulls.
  • Gunay Anach
    Gunay Anach over 3 years
    If Anyone is wondering from where F comes. It is Functions imported as F | from pyspark.sql import functions as F
  • matt
    matt over 3 years
    Good catch @GunayAnach. I updated the answer to include this.