How to drop column according to NAN percentage for dataframe?

22,453

Solution 1

You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:

df = df.loc[:, df.isnull().mean() < .8]

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan

print (df.isnull().mean())
A    0.81
B    0.00
C    0.06
D    0.80
E    0.00
dtype: float64

df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
         B   C         E
0  0.278369 NaN  0.004719
1  0.670749 NaN  0.575093
2  0.209202 NaN  0.219697
3  0.811683 NaN  0.274074
4  0.940030 NaN  0.175410

If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:

np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
     0   1    2    3    4    5    6    7   8    9
0  NaN NaN  NaN  1.0  1.0  NaN  NaN  NaN NaN  NaN
1  1.0 NaN  1.0  NaN  NaN  NaN  NaN  NaN NaN  NaN
2  NaN NaN  NaN  NaN  NaN  1.0  1.0  NaN NaN  NaN
3  NaN NaN  NaN  NaN  1.0  NaN  NaN  NaN NaN  NaN
4  NaN NaN  NaN  NaN  NaN  1.0  NaN  NaN NaN  1.0
5  NaN NaN  NaN  1.0  1.0  NaN  NaN  1.0 NaN  1.0
6  NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN  NaN
7  NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN  NaN
8  NaN NaN  NaN  NaN  NaN  NaN  NaN  1.0 NaN  NaN
9  1.0 NaN  NaN  NaN  1.0  NaN  NaN  1.0 NaN  NaN

df1 = df.dropna(thresh=2, axis=1)
print (df1)
     0    3    4    5    7    9
0  NaN  1.0  1.0  NaN  NaN  NaN
1  1.0  NaN  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  1.0  NaN  NaN
3  NaN  NaN  1.0  NaN  NaN  NaN
4  NaN  NaN  NaN  1.0  NaN  1.0
5  NaN  1.0  1.0  NaN  1.0  1.0
6  NaN  NaN  NaN  NaN  NaN  NaN
7  NaN  NaN  NaN  NaN  NaN  NaN
8  NaN  NaN  NaN  NaN  1.0  NaN
9  1.0  NaN  1.0  NaN  1.0  NaN

EDIT: For non-Boolean data

Total number of NaN entries in a column must be less than 80% of total entries:

 df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]

Solution 2

df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)

Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.

Solution 3

As suggested in comments, if you use sum() on a boolean test, you can get the number of occurences.

Code:

def get_nan_cols(df, nan_percent=0.8):
    threshold = len(df.index) * nan_percent
    return [c for c in df.columns if sum(df[c].isnull()) >= threshold]  

Used as:

del df[get_nan_cols(df, 0.8)]

Solution 4

You can use the pandas dropna. For example:

df.dropna(axis=1, thresh = int(0.2*df.shape[0]), inplace=True)

Notice that we used 0.2 which is 1-0.8 since the thresh refers to the number of non-NA values

Share:
22,453

Related videos on Youtube

LookIntoEast
Author by

LookIntoEast

Updated on March 13, 2021

Comments

  • LookIntoEast
    LookIntoEast about 3 years

    For certain columns of df, if 80% of the column is NAN.

    What's the simplest code to drop such columns?

    • abhiieor
      abhiieor about 7 years
      something like for col in df.columns: if sum(df[col].isnull())/float(len(df.index)) > 0.8: del df[col] should help
  • iff_or
    iff_or about 7 years
    I also appreciate this approach because it's so evidently adaptable to a different decision threshold!
  • piRSquared
    piRSquared about 7 years
    This is my preferred approach
  • Jim
    Jim over 5 years
    the simplicity here is genius
  • Jinhua Wang
    Jinhua Wang over 5 years
    This answer saved my day!
  • sdasara
    sdasara over 4 years
    df.dropna(thresh=int(ratio_nonNA *(len(df.columns))),inplace=True)
  • HD2000
    HD2000 almost 2 years
    I think with "thresh = int(0.8*df.shape[0])" you are saying to keep columns with at least 80% of non-Nan. The question was how to drop columns with at least 80% of Nan.
  • George Pipis
    George Pipis almost 2 years
    Yes, you are right, Many thanks. I edited it. The threshold is the "Require that many non-NA values."