Pyspark dataframe how to drop rows with nulls in all columns?
11,200
Solution 1
One option is to use functools.reduce
to construct the conditions:
from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
+----+----+----+
where reduce
produce a query as follows:
~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>
Solution 2
Providing strategy for na.drop
is all you need:
df = spark.createDataFrame([
(1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
("ID", "TYPE", "CODE")
)
df.na.drop(how="all").show()
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
|null| C|null|
+----+----+----+
Alternative formulation can be achieved with threshold
(number of NOT NULL
values):
df.na.drop(thresh=1).show()
+----+----+----+
| ID|TYPE|CODE|
+----+----+----+
| 1| B| X1|
|null| B| X1|
|null| C|null|
+----+----+----+
Author by
kww
Updated on June 17, 2022Comments
-
kww almost 2 years
For a dataframe, before it is like:
+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null|null|null| |null| B| X1| +----+----+----+
After I hope it's like:
+----+----+----+ | ID|TYPE|CODE| +----+----+----+ | 1| B| X1| |null| B| X1| +----+----+----+
I prefer a general method such that it can apply when
df.columns
is very long. Thanks!