Pyspark dataframe how to drop rows with nulls in all columns?

python apache-spark pyspark apache-spark-sql pyspark-sql

11,200

Solution 1

One option is to use functools.reduce to construct the conditions:

from functools import reduce
df.filter(~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])).show()
+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

where reduce produce a query as follows:

~reduce(lambda x, y: x & y, [df[c].isNull() for c in df.columns])
# Column<b'(NOT (((ID IS NULL) AND (TYPE IS NULL)) AND (CODE IS NULL)))'>

Solution 2

Providing strategy for na.drop is all you need:

df = spark.createDataFrame([
    (1, "B", "X1"), (None, None, None), (None, "B", "X1"), (None, "C", None)],
    ("ID", "TYPE", "CODE")
)

df.na.drop(how="all").show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+  
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

Alternative formulation can be achieved with threshold (number of NOT NULL values):

df.na.drop(thresh=1).show()

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
|null|   C|null|
+----+----+----+

11,200

Author by

kww

Updated on June 17, 2022

Comments

kww almost 2 years

For a dataframe, before it is like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|null|null|
|null|   B|  X1|
+----+----+----+

After I hope it's like:

+----+----+----+
|  ID|TYPE|CODE|
+----+----+----+
|   1|   B|  X1|
|null|   B|  X1|
+----+----+----+

I prefer a general method such that it can apply when df.columns is very long. Thanks!

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

How to filter a python Spark DataFrame by date between two date format columns

LEFT and RIGHT function in PySpark SQL

pyspark, Compare two rows in dataframe

Whats is the correct way to sum different dataframe columns in a list in pyspark?

Remove an element from a Python list of lists in PySpark DataFrame

PySpark: Take average of a column after using filter function

writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark

Why agg() in PySpark is only able to summarize one column at a time?

PySpark - Add a new column with a Rank by User

Fill Pyspark dataframe column null values with average value from same column

Pyspark dataframe how to drop rows with nulls in all columns?

Solution 1

Solution 2

kww

Comments

Recents

Related