Detect and exclude outliers in a pandas DataFrame

432,793

Solution 1

If you have multiple columns in your dataframe and would like to remove all rows that have outliers in at least one column, the following expression would do that in one shot.

df = pd.DataFrame(np.random.randn(100, 3))

import numpy as np
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

description:

  • For each column, it first computes the Z-score of each value in the column, relative to the column mean and standard deviation.
  • It then takes the absolute Z-score because the direction does not matter, only if it is below the threshold.
  • all(axis=1) ensures that for each row, all column satisfy the constraint.
  • Finally, the result of this condition is used to index the dataframe.

Filter other columns based on a single column

  • Specify a column for the zscore, df[0] for example, and remove .all(axis=1).
df[(np.abs(stats.zscore(df[0])) < 3)]

Solution 2

For each of your dataframe column, you could get quantile with:

q = df["col"].quantile(0.99)

and then filter with:

df[df["col"] < q]

If one need to remove lower and upper outliers, combine condition with an AND statement:

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

Solution 3

Use boolean indexing as you would do in numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

For a series it is similar:

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

Solution 4

This answer is similar to that provided by @tanemaki, but uses a lambda expression instead of scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
   .all(axis=1)]

To filter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations:

df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]

See here for how to apply this z-score on a rolling basis: Rolling Z-score applied to pandas dataframe

Solution 5

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out
Share:
432,793

Related videos on Youtube

user1121201
Author by

user1121201

Programmer

Updated on July 08, 2022

Comments

  • user1121201
    user1121201 almost 2 years

    I have a pandas data frame with few columns.

    Now I know that certain rows are outliers based on a certain column value.

    For instance

    column 'Vol' has all values around 12xx and one value is 4000 (outlier).

    Now I would like to exclude those rows that have Vol column like this.

    So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from mean.

    What is an elegant way to achieve this?

    • chandni mirchandani
      chandni mirchandani about 2 years
      did you got the solution ?
  • Jeff
    Jeff about 10 years
    their is a DataFrame.abs() FYI, also DataFrame.clip()
  • CT Zhu
    CT Zhu about 10 years
    In the case of clip(), Jeff, the outlines are not removed: df.SOME_DATA.clip(-3std,+3std) assign the outliners to either +3std or -3std
  • user1121201
    user1121201 about 10 years
    What if i need hte same on a pd.Series?
  • CT Zhu
    CT Zhu about 10 years
    That is almost the same, @AMM
  • samthebrand
    samthebrand over 8 years
    Can you explain what this code is doing? And perhaps provide an idea how I might remove all rows that have an outlier in a single specified column? Would be helpful. Thanks.
  • rafaelvalle
    rafaelvalle almost 8 years
    For each column, first it computes the Z-score of each value in the column, relative to the column mean and standard deviation. Then is takes the absolute of Z-score because the direction does not matter, only if it is below the threshold. .all(axis=1) ensures that for each row, all column satisfy the constraint. Finally, result of this condition is used to index the dataframe.
  • DreamerP
    DreamerP about 6 years
    How can we do the same thing if our pandas data frame has 100 columns?
  • user6903745
    user6903745 about 6 years
    This article gives a very good overview of outlier removal techniques machinelearningmastery.com/…
  • Imran Ahmad Ghazali
    Imran Ahmad Ghazali about 6 years
    I am getting error "ValueError: Cannot index with multidimensional key" in line " df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)] " Will you help
  • JE_Muc
    JE_Muc almost 6 years
    Awesome, thanks for that answer @CTZhu. @DreamerP you can just apply it to the whole DataFrame with: df_new = df[np.abs(df - df.mean()) <= (3 * df.std())]. But in contrast to applying it to a Series or single column, this will replace outliers with np.nan and keep the shape of the DataFrame, so interpolation might be needed to fill the missing values.
  • asimo
    asimo over 5 years
    How would you handle the situation when there are Nulls/Nans in the columns. How can we have them ignored ?
  • wordsforthewise
    wordsforthewise over 5 years
    trimboth was easiest for me.
  • RajeshM
    RajeshM over 5 years
    Can't make assumptions about why the OP wants to do something.
  • BCArg
    BCArg about 5 years
    Here you are selecting only data within the interquartile range (IQR), but keep in mind that there can be values outside this range that are not outliers.
  • ssp
    ssp about 5 years
    how do we deal with str columns for this solution? If some of the columns are non-numeric and we want to remove outliers based on all numeric columns.
  • Priyansh
    Priyansh about 5 years
    @rafaelvalle What is the significance of 3 in the code above, can you explain that?
  • rafaelvalle
    rafaelvalle about 5 years
    assuming distribution X with mean mu and standard deviation sigma, the z score measures how many sigmas a value is from mu. algebraically: z-score = (x - mu) / sigma. the 3 is the threshold in number of standard deviations away from the mean.
  • KeyMaker00
    KeyMaker00 almost 5 years
    Succinct and elegant for all dataset's attributes. I like it. I have taken the liberty to extend your answer (see bellow) to handle a data-frame than might contain also non-numerical values. Hope it can help someone.
  • PascalVKooten
    PascalVKooten almost 5 years
    Choosing e.g. 0.1 and 0.9 would be pretty safe I think. Using between and the quantiles like this is a pretty syntax.
  • sak
    sak almost 5 years
    Got error: "TypeError: unsupported operand type(s) for /: 'str' and 'int'"
  • Erfan
    Erfan over 4 years
    This should be le(3) since its removing outliers. This way you get True for the outliers. Besides that +1 and this answer should be higher up
  • RK1
    RK1 over 4 years
    Great solution! As a heads up reduce=False has been deprecated since pandas version 0.23.0
  • Sam Vanhoutte
    Sam Vanhoutte over 4 years
    @sak : that is because you are running this on all columns and rows in your dataset. so, it expects these values to be numeric. typically you can execute the above by column name, or first apply (one-hot?) encoding to make all your values numeric, before executing this.
  • Sam Vanhoutte
    Sam Vanhoutte over 4 years
    @tanemaki - I also use this code for the outliers, but I am now looking to get the list of columns that contain values that fall outside of the 3-sigma range. so I can use this to scan a dataset and get a good indication of which columns contain outliers.
  • indolentdeveloper
    indolentdeveloper over 4 years
    this might remove outliers only from upper bound.. not lower?
  • user6903745
    user6903745 over 4 years
    @indolentdeveloper you are right, just invert the inequality to remove lower outliers, or combine them with an OR operator.
  • indolentdeveloper
    indolentdeveloper over 4 years
    The idea of comment was to up update the answers ;). Since someone can miss this point.
  • Ekaba Bisong
    Ekaba Bisong over 4 years
    Substitute result_type='reduce' for reduce=False.
  • A.B
    A.B over 4 years
    @user6903745 AND statement or "OR"?
  • JOHN
    JOHN over 4 years
    @rafaelvalle. Should we use all or any?
  • user6903745
    user6903745 over 4 years
    @A.B yes that's an AND statement, mistake in my previous comment
  • Admin
    Admin about 4 years
    @user6903745 df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)] I guess this statement is enough to remove both upper and lower outliers. I don't know why this isn't enough
  • user6903745
    user6903745 about 4 years
    @Ashwani Yes, the answer should have been already corrected accordinly.
  • bendl
    bendl about 4 years
    This fails in the event that an entire column has the same value - in these cases zscore returns NaN and therefore the < 3 check returns False for every row, so it drops every record.
  • bendl
    bendl about 4 years
    This fails in the event that an entire column has the same value - in these cases zscore returns NaN and therefore the < 3 check returns False for every row dropping every record.
  • enricw
    enricw almost 4 years
    I get "AttributeError: 'DataFrame' object has no attribute 'Data' ". Anyone know how to tackle this?
  • seralouk
    seralouk over 3 years
    It's better to explicitly state the axis: df[(np.abs(stats.zscore(df, axis=0)) < 3).all(axis=1)]
  • user140259
    user140259 about 3 years
    @tanemaki I made boxplot graphs before and after using that command (in jupyter notebook, it shows you q1, q3, upper fence, lower_fence). Checking my results after using that command, I have fewer records, but some of them would still be outliers because their values are higher than upper_fence or lower than lower_fence from my original boxplot. Do you know why that happens? Thank you.
  • taga
    taga almost 3 years
    Can you tell me what number 3 represents in df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] ?
  • Keivan
    Keivan over 2 years
    The number 3 represent the 3 standard deviation. You can find more information about it here: sixsigmastudyguide.com/z-scores-z-table-z-transformations
  • Lorenzo Bassetti
    Lorenzo Bassetti over 2 years
    @sak : if some numbers are wrongly read as strings, you can try this: DF["column"] = pd.to_numeric(DF["column"]) . It will transform strings to numbers, if they contains numbers of course.
  • sak
    sak over 2 years
    @LorenzoBassetti Thanks Lorenzo, 2 yrs.
  • flashliquid
    flashliquid over 2 years
    @KeyMaker00 I'd really like to use this but I get the following error: ValueError: No axis named 1 for object type Series
  • kommradHomer
    kommradHomer over 2 years
    for those who have 10k of data and just a dozen outliers , quantile doesn't help. I'd suggest z-score
  • user6903745
    user6903745 over 2 years
    @kommradHomer could you elaborate please?
  • kommradHomer
    kommradHomer over 2 years
    @user6903745 when you have a few thousands of 0s and 1s in a series of 10.000 values and like only 20-30 values above 1 , you need quantiles like 0.9999 to see something different
  • user6903745
    user6903745 over 2 years
    @kommradHomer I agree, it might depend on the shape of the data distribution
  • till Kadabra
    till Kadabra over 2 years
    To avoid dropping rows with NaNs in non-numerical columns use df.dropna(how='any', subset=cols, inplace=True)
  • Aaditya Ura
    Aaditya Ura about 2 years
    Hi, could you take a look at this question stackoverflow.com/questions/70954791/…
  • nathan liang
    nathan liang almost 2 years
    @enricw That means your dataframe has no column named 'Data', you'll need to select the column with the right name for you.
  • user1259201
    user1259201 almost 2 years
    I believe np.logical_or should be np.logical_and to work properly (option 2)