How to replace outliers with the 5th and 95th percentile values in R

r dataset outliers quantile

30,048

Solution 1

This would do it.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )

Solution 2

You can do it in one line of code using squish():

d2 <- squish(d, quantile(d, c(.05, .95)))

In the scales library, look at ?squish and ?discard

#--------------------------------
library(scales)

pr <- .95
q  <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------

# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))

example:

d <- 1:20
d
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1]  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 19

Solution 3

I used this code to get what you need:

qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
                  value = ifelse(value > qn[2], qn[2], value)})

where df is your data.frame, and value the column that contains your data.

Solution 4

There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly

To know more about outlier treatment refer here

capOutlier <- function(x){
   qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
   caps <- quantile(x, probs=c(.05, .95), na.rm = T)
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qnt[1] - H)] <- caps[1]
   x[x > (qnt[2] + H)] <- caps[2]
   return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame

View more solutions

30,048

Author by

Bobbo

Updated on August 05, 2022

Comments

Bobbo almost 2 years

I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.

Any advice would be much appreciated, I can't find any information on how to do this anywhere else.
Bobbo over 11 years

Thank you, works like a dream. I'm new to this website, is there any way I can give you rep or something for this answer?
Bobbo over 11 years

thank you for your answer, both yours and the one above work perfectly
Romain Francois over 11 years

you can up the answer(s) and accept it (you accepted it already). See stackoverflow.com/faq which will also give you a badge if you read them all
Bolaka over 9 years

The above snippet will also replace NAs (if any) by the quantile values!
ctbrown over 5 years

That is a rigid definition of an outlier. Whether you define the outlier definition at below 20% / above 80%+ (as you have defined) or below 5% / above 95%+ (as the OP) is arbitrary; what works will depend on your problem and data.
Kyle Peters over 5 years

I didn't define it as below 20% or above 80%. I used a common definition of an outlier that will probably be used in an introduction to statistics class. Anything less the first quartile - 1.5 * the interquartile range or above the third quartile + 1.5 * the interquartile range is considered an outlier. The interquartile range(IQR) is the range between the first quartile and the third quartile (the middle 50% of the data).
ctbrown over 5 years

That is not a "common" definition of what an outlier is. It is an arbitrary one.
Kyle Peters over 5 years

If you take a 101 statistics class in college, they will give you this definition of what an outlier is. Check the website in my answer. There are other definitions of what an outlier is, but this is the most basic and most used one. And, the definition I posted is more accurate than the one given in the question. If you had the data (.99998,1,1,1,1,1,1,1,1.0001), then .99998 and 1.0001 would be classified wrongly as outliers if you used the outlier classification method described in the question.
Ben over 4 years

Nice. Or you could roll squish into your own function. cap <- function(x, low, high) pmin(high, pmax(low, x))
Jason Goal over 2 years

check the .clip function from pandas pandas.pydata.org/docs/reference/api/… as well