Detect and exclude outliers in a pandas DataFrame

python pandas filtering dataframe outliers

432,793

Solution 1

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

import numpy as np
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

For each column, it first computes the Z-score of each value in the column, relative to the column mean and standard deviation.
It then takes the absolute Z-score because the direction does not matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all column satisfy the constraint.
Finally, the result of this condition is used to index the dataframe.

Filter other columns based on a single column

Specify a column for the zscore, df[0] for example, and remove .all(axis=1).

df[(np.abs(stats.zscore(df[0])) < 3)]

Solution 2

For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

Solution 3

Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

Solution 4

This answer is similar to that provided by @tanemaki, but uses a lambda expression instead of scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
   .all(axis=1)]

To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:

df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe

Solution 5

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

View more solutions

432,793

user1121201

Programmer

Updated on July 08, 2022

Comments

user1121201 almost 2 years

I have a pandas data frame with few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance

column 'Vol' has all values around 12xx and one value is 4000 (outlier).

Now I would like to exclude those rows that have Vol column like this.

So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

What is an elegant way to achieve this?
- chandni mirchandani about 2 years
  
  did you got the solution ?
Jeff about 10 years

their is a DataFrame.abs() FYI, also DataFrame.clip()
CT Zhu about 10 years

In the case of clip(), Jeff, the outlines are not removed: df.SOME_DATA.clip(-3std,+3std) assign the outliners to either +3std or -3std
user1121201 about 10 years

What if i need hte same on a pd.Series?
CT Zhu about 10 years

That is almost the same, @AMM
samthebrand over 8 years

Can you explain what this code is doing? And perhaps provide an idea how I might remove all rows that have an outlier in a single specified column? Would be helpful. Thanks.
rafaelvalle almost 8 years

For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation. Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold. .all(axis=1) ensures that for each row, all column satisfy the constraint. Finally, result of this condition is used to index the dataframe.
DreamerP about 6 years

How can we do the same thing if our pandas data frame has 100 columns?
user6903745 about 6 years

This article gives a very good overview of outlier removal techniques machinelearningmastery.com/…
Imran Ahmad Ghazali about 6 years

I am getting error "ValueError: Cannot index with multidimensional key" in line " df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)] " Will you help
JE_Muc almost 6 years

Awesome, thanks for that answer @CTZhu. @DreamerP you can just apply it to the whole DataFrame with: df_new = df[np.abs(df - df.mean()) <= (3 * df.std())]. But in contrast to applying it to a Series or single column, this will replace outliers with np.nan and keep the shape of the DataFrame, so interpolation might be needed to fill the missing values.
asimo over 5 years

How would you handle the situation when there are Nulls/Nans in the columns. How can we have them ignored ?
wordsforthewise over 5 years

trimboth was easiest for me.
RajeshM over 5 years

Can't make assumptions about why the OP wants to do something.
BCArg about 5 years

Here you are selecting only data within the interquartile range (IQR), but keep in mind that there can be values outside this range that are not outliers.
ssp about 5 years

how do we deal with str columns for this solution? If some of the columns are non-numeric and we want to remove outliers based on all numeric columns.
Priyansh about 5 years

@rafaelvalle What is the significance of 3 in the code above, can you explain that?
rafaelvalle about 5 years

assuming distribution X with mean mu and standard deviation sigma, the z score measures how many sigmas a value is from mu. algebraically: z-score = (x - mu) / sigma. the 3 is the threshold in number of standard deviations away from the mean.
KeyMaker00 almost 5 years

Succinct and elegant for all dataset's attributes. I like it. I have taken the liberty to extend your answer (see bellow) to handle a data-frame than might contain also non-numerical values. Hope it can help someone.
PascalVKooten almost 5 years

Choosing e.g. 0.1 and 0.9 would be pretty safe I think. Using between and the quantiles like this is a pretty syntax.
sak almost 5 years

Got error: "TypeError: unsupported operand type(s) for /: 'str' and 'int'"
Erfan over 4 years

This should be le(3) since its removing outliers. This way you get True for the outliers. Besides that +1 and this answer should be higher up
RK1 over 4 years

Great solution! As a heads up reduce=False has been deprecated since pandas version 0.23.0
Sam Vanhoutte over 4 years

@sak : that is because you are running this on all columns and rows in your dataset. so, it expects these values to be numeric. typically you can execute the above by column name, or first apply (one-hot?) encoding to make all your values numeric, before executing this.
Sam Vanhoutte over 4 years

@tanemaki - I also use this code for the outliers, but I am now looking to get the list of columns that contain values that fall outside of the 3-sigma range. so I can use this to scan a dataset and get a good indication of which columns contain outliers.
indolentdeveloper over 4 years

this might remove outliers only from upper bound.. not lower?
user6903745 over 4 years

@indolentdeveloper you are right, just invert the inequality to remove lower outliers, or combine them with an OR operator.
indolentdeveloper over 4 years

The idea of comment was to up update the answers ;). Since someone can miss this point.
Ekaba Bisong over 4 years

Substitute result_type='reduce' for reduce=False.
A.B over 4 years

@user6903745 AND statement or "OR"?
JOHN over 4 years

@rafaelvalle. Should we use all or any?
user6903745 over 4 years

@A.B yes that's an AND statement, mistake in my previous comment
Admin about 4 years

@user6903745 df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)] I guess this statement is enough to remove both upper and lower outliers. I don't know why this isn't enough
user6903745 about 4 years

@Ashwani Yes, the answer should have been already corrected accordinly.
bendl about 4 years

This fails in the event that an entire column has the same value - in these cases zscore returns NaN and therefore the < 3 check returns False for every row, so it drops every record.
bendl about 4 years

This fails in the event that an entire column has the same value - in these cases zscore returns NaN and therefore the < 3 check returns False for every row dropping every record.
enricw almost 4 years

I get "AttributeError: 'DataFrame' object has no attribute 'Data' ". Anyone know how to tackle this?
seralouk over 3 years

It's better to explicitly state the axis: df[(np.abs(stats.zscore(df, axis=0)) < 3).all(axis=1)]
user140259 about 3 years

@tanemaki I made boxplot graphs before and after using that command (in jupyter notebook, it shows you q1, q3, upper fence, lower_fence). Checking my results after using that command, I have fewer records, but some of them would still be outliers because their values are higher than upper_fence or lower than lower_fence from my original boxplot. Do you know why that happens? Thank you.
taga almost 3 years

Can you tell me what number 3 represents in df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] ?
Keivan over 2 years

The number 3 represent the 3 standard deviation. You can find more information about it here: sixsigmastudyguide.com/z-scores-z-table-z-transformations
Lorenzo Bassetti over 2 years

@sak : if some numbers are wrongly read as strings, you can try this: DF["column"] = pd.to_numeric(DF["column"]) . It will transform strings to numbers, if they contains numbers of course.
sak over 2 years

@LorenzoBassetti Thanks Lorenzo, 2 yrs.
flashliquid over 2 years

@KeyMaker00 I'd really like to use this but I get the following error: ValueError: No axis named 1 for object type Series
kommradHomer over 2 years

for those who have 10k of data and just a dozen outliers , quantile doesn't help. I'd suggest z-score
user6903745 over 2 years

@kommradHomer could you elaborate please?
kommradHomer over 2 years

@user6903745 when you have a few thousands of 0s and 1s in a series of 10.000 values and like only 20-30 values above 1 , you need quantiles like 0.9999 to see something different
user6903745 over 2 years

@kommradHomer I agree, it might depend on the shape of the data distribution
till Kadabra over 2 years

To avoid dropping rows with NaNs in non-numerical columns use df.dropna(how='any', subset=cols, inplace=True)
Aaditya Ura about 2 years

Hi, could you take a look at this question stackoverflow.com/questions/70954791/…
nathan liang almost 2 years

@enricw That means your dataframe has no column named 'Data', you'll need to select the column with the right name for you.
user1259201 almost 2 years

I believe np.logical_or should be np.logical_and to work properly (option 2)