How to replace outliers with the 5th and 95th percentile values in R

30,048

Solution 1

This would do it.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )

Solution 2

You can do it in one line of code using squish():

d2 <- squish(d, quantile(d, c(.05, .95)))



In the scales library, look at ?squish and ?discard

#--------------------------------
library(scales)

pr <- .95
q  <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------

# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))

example:

d <- 1:20
d
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1]  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 19

Solution 3

I used this code to get what you need:

qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
                  value = ifelse(value > qn[2], qn[2], value)})

where df is your data.frame, and value the column that contains your data.

Solution 4

There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly

To know more about outlier treatment refer here

capOutlier <- function(x){
   qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
   caps <- quantile(x, probs=c(.05, .95), na.rm = T)
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qnt[1] - H)] <- caps[1]
   x[x > (qnt[2] + H)] <- caps[2]
   return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame
Share:
30,048
Bobbo
Author by

Bobbo

Updated on August 05, 2022

Comments

  • Bobbo
    Bobbo almost 2 years

    I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.

    Any advice would be much appreciated, I can't find any information on how to do this anywhere else.

  • Bobbo
    Bobbo over 11 years
    Thank you, works like a dream. I'm new to this website, is there any way I can give you rep or something for this answer?
  • Bobbo
    Bobbo over 11 years
    thank you for your answer, both yours and the one above work perfectly
  • Romain Francois
    Romain Francois over 11 years
    you can up the answer(s) and accept it (you accepted it already). See stackoverflow.com/faq which will also give you a badge if you read them all
  • Bolaka
    Bolaka over 9 years
    The above snippet will also replace NAs (if any) by the quantile values!
  • ctbrown
    ctbrown over 5 years
    That is a rigid definition of an outlier. Whether you define the outlier definition at below 20% / above 80%+ (as you have defined) or below 5% / above 95%+ (as the OP) is arbitrary; what works will depend on your problem and data.
  • Kyle Peters
    Kyle Peters over 5 years
    I didn't define it as below 20% or above 80%. I used a common definition of an outlier that will probably be used in an introduction to statistics class. Anything less the first quartile - 1.5 * the interquartile range or above the third quartile + 1.5 * the interquartile range is considered an outlier. The interquartile range(IQR) is the range between the first quartile and the third quartile (the middle 50% of the data).
  • ctbrown
    ctbrown over 5 years
    That is not a "common" definition of what an outlier is. It is an arbitrary one.
  • Kyle Peters
    Kyle Peters over 5 years
    If you take a 101 statistics class in college, they will give you this definition of what an outlier is. Check the website in my answer. There are other definitions of what an outlier is, but this is the most basic and most used one. And, the definition I posted is more accurate than the one given in the question. If you had the data (.99998,1,1,1,1,1,1,1,1.0001), then .99998 and 1.0001 would be classified wrongly as outliers if you used the outlier classification method described in the question.
  • Ben
    Ben over 4 years
    Nice. Or you could roll squish into your own function. cap <- function(x, low, high) pmin(high, pmax(low, x))
  • Jason Goal
    Jason Goal over 2 years
    check the .clip function from pandas pandas.pydata.org/docs/reference/api/… as well