How to drop columns which have same values in all rows via pandas or spark dataframe?

python pandas apache-spark-sql duplicates multiple-columns

28,858

Solution 1

What we can do is use nunique to calculate the number of unique values in each column of the dataframe, and drop the columns which only have a single unique value:

In [285]:
nunique = df.nunique()
cols_to_drop = nunique[nunique == 1].index
df.drop(cols_to_drop, axis=1)

Out[285]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Another way is to just diff the numeric columns, take abs values and sums them:

In [298]:
cols = df.select_dtypes([np.number]).columns
diff = df[cols].diff().abs().sum()
df.drop(diff[diff== 0].index, axis=1)

Out[298]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Another approach is to use the property that the standard deviation will be zero for a column with the same value:

In [300]:
cols = df.select_dtypes([np.number]).columns
std = df[cols].std()
cols_to_drop = std[std==0].index
df.drop(cols_to_drop, axis=1)

Out[300]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Actually the above can be done in a one-liner:

In [306]:
df.drop(df.std()[(df.std() == 0)].index, axis=1)

Out[306]:
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Solution 2

A simple one liner(python):

df=df[[i for i in df if len(set(df[i]))>1]]

Solution 3

Another solution is set_index from column which are not compared and then compare first row selected by iloc by eq with all DataFrame and last use boolean indexing:

df1 = df.set_index(['index','id','name',])
print (~df1.eq(df1.iloc[0]).all())
value     False
value2    False
value3    False
data1      True
val5      False
dtype: bool

print (df1.ix[:, (~df1.eq(df1.iloc[0]).all())].reset_index())
   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

Solution 4

Select the columns with boolean indexing:

df[df.columns[df.nunique() > 1]]

Solution 5

pythonic solution

Original DataFrame

index id   name  value  value2  value3  data1  val5
    0  345  name1    1      99      23     3      66
    1   12  name2    1      99      23     2      66
    5    2  name6    1      99      23     7      66

Solution

for col in df.columns:  # Loop through columns
  if len(df[col].unique()) == 1:  # Find unique values in column along with their length and if len is == 1 then it contains same values
    df.drop([col], axis=1, inplace=True)  # Drop the column

Dataframe after executing above code

   index   id   name  data1
0      0  345  name1      3
1      1   12  name2      2
2      5    2  name6      7

View more solutions

28,858

Author by

CYAN CEVI

Updated on June 17, 2021

Comments

CYAN CEVI almost 3 years

Suppose I've data similar to following:

  index id   name  value  value2  value3  data1  val5
    0  345  name1    1      99      23     3      66
    1   12  name2    1      99      23     2      66
    5    2  name6    1      99      23     7      66

How can we drop all those columns like (value, value2, value3) where all rows have the same values, in one command or couple of commands using python?

Consider we have many columns similar to value, value2, value3...value200.

Output:

   index    id  name   data1
       0   345  name1    3
       1    12  name2    2
       5     2  name6    7

Ben over 2 years

Caveat: dropping column by number of unique entries==1 ends up dropping columns that have only one non-NAN entry. That may or may not be the intent.
Marko Tankosic over 2 years

Great answer! Alternative would be: df = df[[i for i in df if df[i].nunique()>1]]
chandni mirchandani about 2 years

can someone help how to delete the columns that have 95% of same values