How to detect null column in pyspark
Solution 1
Extend the condition to
from pyspark.sql.functions import min, max
((min(c).isNull() & max(c).isNull()) | (min(c) == max(c))).alias(c)
or use eqNullSafe
(PySpark 2.3):
(min(c).eqNullSafe(max(c))).alias(c)
Solution 2
How about this? In order to guarantee the column are all nulls, two properties must be satisfied:
(1) The min value is equal to the max value
(2) The min or max is null
Or, equivalently
(1) The min AND max are both equal to None
Note that if property (2) is not satisfied, the case where column values are [null, 1, null, 1]
would be incorrectly reported since the min and max will be 1
.
import pyspark.sql.functions as F
def get_null_column_names(df):
column_names = []
for col_name in df.columns:
min_ = df.select(F.min(col_name)).first()[0]
max_ = df.select(F.max(col_name)).first()[0]
if min_ is None and max_ is None:
column_names.append(col_name)
return column_names
Here's an example in practice:
>>> rows = [(None, 18, None, None),
(1, None, None, None),
(1, 9, 4.0, None),
(None, 0, 0., None)]
>>> schema = "a: int, b: int, c: float, d:int"
>>> df = spark.createDataFrame(data=rows, schema=schema)
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
|null| 18|null|null|
| 1|null|null|null|
| 1| 9| 4.0|null|
|null| 0| 0.0|null|
+----+----+----+----+
>>> get_null_column_names(df)
['d']
Solution 3
One way would be to do it implicitly: select each column, count its NULL values, and then compare this with the total number or rows. With your data, this would be:
spark.version
# u'2.2.0'
from pyspark.sql.functions import col
nullColumns = []
numRows = df.count()
for k in df.columns:
nullRows = df.where(col(k).isNull()).count()
if nullRows == numRows: # i.e. if ALL values are NULL
nullColumns.append(k)
nullColumns
# ['D']
But there is a simpler way: it turns out that the function countDistinct
, when applied to a column with all NULL values, returns zero (0):
from pyspark.sql.functions import countDistinct
df.agg(countDistinct(df.D).alias('distinct')).collect()
# [Row(distinct=0)]
So the for
loop now can be:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).collect()[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
UPDATE (after comments): It seems possible to avoid collect
in the second solution; since df.agg
returns a dataframe with only one row, replacing collect
with take(1)
will safely do the job:
nullColumns = []
for k in df.columns:
if df.agg(countDistinct(df[k])).take(1)[0][0] == 0:
nullColumns.append(k)
nullColumns
# ['D']
Related videos on Youtube
Mehdi Ben Hamida
Updated on May 27, 2022Comments
-
Mehdi Ben Hamida over 1 year
I have a dataframe defined with some null values. Some Columns are fully null values.
>> df.show() +---+---+---+----+ | A| B| C| D| +---+---+---+----+ |1.0|4.0|7.0|null| |2.0|5.0|7.0|null| |3.0|6.0|5.0|null| +---+---+---+----+
In my case, I want to return a list of columns name that are filled with null values. My idea was to detect the constant columns (as the whole column contains the same null value).
this is how I did it:
nullCoulumns = [c for c, const in df.select([(min(c) == max(c)).alias(c) for c in df.columns]).first().asDict().items() if const]
but this does no consider null columns as constant, it works only with values. How should I then do it ?
-
desertnaut almost 6 years@MehdiBenHamida Careful!
collect
here is not applied to the whole of your data, but to an aggregation. Arguably, this is a legitimate use ofcollect
. -
Mehdi Ben Hamida almost 6 yearsI know that collect is about the aggregation but still consuming a lot of performance :/
-
desertnaut almost 6 years@MehdiBenHamida perhaps you have not realized that what you ask is not at all trivial: one way or another, you'll have to go through all your rows, checking for NULLs; now, if you are not happy even to use
countDistinct
... (as ifmin
andmax
would take less time, if applicable) -
Mehdi Ben Hamida almost 6 yearsthis will consume a lot time to detect all null columns, I think there is a better alternative. For the first suggested solution, I tried it; it better than the second one but still taking too much time. I think, there is a better alternative!
-
Mehdi Ben Hamida almost 6 years@desertnaut: this is a pretty faster, takes only decim seconds :D
-
matt over 5 yearsThis works for the case when all values in the column are null. But consider the case with column values of
[null, 1, 1, null]
. In this case, the min and max will both equal1
. Thus, will get identified incorrectly as having all nulls. -
Gunay Anach over 3 yearsIf Anyone is wondering from where F comes. It is Functions imported as F | from pyspark.sql import functions as F
-
matt over 3 yearsGood catch @GunayAnach. I updated the answer to include this.