Pyspark dataframe how to drop rows with nulls in all columns?

11,200

Solution 1

One option is to use functools.reduce to construct the conditions:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

where reduce produce a query as follows:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>

Solution 2

Providing strategy for na.drop is all you need:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

Alternative formulation can be achieved with threshold (number of NOT NULL values):

df.na.drop(thresh=1).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+
Share:
11,200
kww
Author by

kww

Updated on June 17, 2022

Comments

  • kww
    kww almost 2 years

    For a dataframe, before it is like:

    +----+----+----+
    |  ID|TYPE|CODE|
    +----+----+----+
    |   1|   B|  X1|
    |null|null|null|
    |null|   B|  X1|
    +----+----+----+
    

    After I hope it's like:

    +----+----+----+
    |  ID|TYPE|CODE|
    +----+----+----+
    |   1|   B|  X1|
    |null|   B|  X1|
    +----+----+----+
    

    I prefer a general method such that it can apply when df.columns is very long. Thanks!