Drop if all entries in a spark dataframe's specific column is null

python apache-spark pyspark

11,013

Solution 1

I tried my way. Say, I have a dataframe as below,

from pyspark.sql import functions as F
>>> df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   2|null|
|null|   3|null|
|   5|null|null|
+----+----+----+
>>> df1 = df.agg(*[F.count(c).alias(c) for c in df.columns])
>>> df1.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   2|   2|   0|
+----+----+----+
>>> nonNull_cols = [c for c in df1.columns if df1[[c]].first()[c] > 0]
>>> df = df.select(*nonNull_cols)
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|null|   3|
|   5|null|
+----+----+

Solution 2

for me it worked in a bit different way than @Suresh answer:

nonNull_cols = [c for c in original_df.columns if original_df.filter(func.col(c).isNotNull()).count() > 0]
new_df = original_df.select(*nonNull_cols)

Solution 3

One of the indirect way to do so is

import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.isnan(func.col(col)) == True).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Update:
The above code drops columns with all nan. If you are looking for all nulls then

import pyspark.sql.functions as func
for col in sdf.columns:
if (sdf.filter(func.col(col).isNull()).count() == sdf.select(func.col(col)).count()):
    sdf = sdf.drop(col)

Will update my answer if I find some optimal way :-)

Solution 4

This is a function I have in my pipeline to remove null columns. Hope it helps!

# Function to drop the empty columns of a DF
def dropNullColumns(df):
    # A set of all the null values you can encounter
    null_set = {"none", "null" , "nan"}
    # Iterate over each column in the DF
    for col in df.columns:
        # Get the distinct values of the column
        unique_val = df.select(col).distinct().collect()[0][0]
        # See whether the unique value is only none/nan or null
        if str(unique_val).lower() in null_set:
            print("Dropping " + col + " because of all null values.")
            df = df.drop(col)
    return(df)

Solution 5

Here's a much more efficient solution that doesn't involve looping over the columns. It is much faster when you have many columns. I tested the other methods here on a dataframe with 800 columns, which took 17 mins to run. The following method takes only 1 min in my tests on the same dataset.

def drop_fully_null_columns(df, but_keep_these=[]):
    """Drops DataFrame columns that are fully null
    (i.e. the maximum value is null)
    Arguments:
        df {spark DataFrame} -- spark dataframe
        but_keep_these {list} -- list of columns to keep without checking for nulls
    Returns:
        spark DataFrame -- dataframe with fully null columns removed
    """
    # skip checking some columns
    cols_to_check = [col for col in df.columns if col not in but_keep_these]
    if len(cols_to_check) > 0:
        # drop columns for which the max is None
        rows_with_data = df.select(*cols_to_check).groupby().agg(*[F.max(c).alias(c) for c in cols_to_check]).take(1)[0]
        cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None]
        new_df = df.drop(*cols_to_drop)
        return new_df
    else:
        return df

View more solutions

11,013

Author by

Naveen Honest Raj K

2+ years of product engineering at startups. Extremely passionate about building Saas products, scaling applications and improving customer experience.

Updated on June 12, 2022

Comments

Naveen Honest Raj K 4 months
Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data.

Edited: As per Suresh Request,
```
for column in media.columns:
    if media.select(media[column]).distinct().count() == 1:
        media = media.drop(media[column])
```
Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know.