How to replace outliers with the 5th and 95th percentile values in R
Solution 1
This would do it.
fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun( yourdata )
Solution 2
You can do it in one line of code using squish()
:
d2 <- squish(d, quantile(d, c(.05, .95)))
In the scales library, look at ?squish
and ?discard
#--------------------------------
library(scales)
pr <- .95
q <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------
# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))
example:
d <- 1:20
d
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1] 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19
Solution 3
I used this code to get what you need:
qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
value = ifelse(value > qn[2], qn[2], value)})
where df
is your data.frame, and value
the column that contains your data.
Solution 4
There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly
To know more about outlier treatment refer here
capOutlier <- function(x){
qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
caps <- quantile(x, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(x, na.rm = T)
x[x < (qnt[1] - H)] <- caps[1]
x[x > (qnt[2] + H)] <- caps[2]
return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame
Bobbo
Updated on August 05, 2022Comments
-
Bobbo almost 2 years
I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.
Any advice would be much appreciated, I can't find any information on how to do this anywhere else.
-
Bobbo over 11 yearsThank you, works like a dream. I'm new to this website, is there any way I can give you rep or something for this answer?
-
Bobbo over 11 yearsthank you for your answer, both yours and the one above work perfectly
-
Romain Francois over 11 yearsyou can up the answer(s) and accept it (you accepted it already). See stackoverflow.com/faq which will also give you a badge if you read them all
-
Bolaka over 9 yearsThe above snippet will also replace NAs (if any) by the quantile values!
-
ctbrown over 5 yearsThat is a rigid definition of an outlier. Whether you define the outlier definition at below 20% / above 80%+ (as you have defined) or below 5% / above 95%+ (as the OP) is arbitrary; what works will depend on your problem and data.
-
Kyle Peters over 5 yearsI didn't define it as below 20% or above 80%. I used a common definition of an outlier that will probably be used in an introduction to statistics class. Anything less the first quartile - 1.5 * the interquartile range or above the third quartile + 1.5 * the interquartile range is considered an outlier. The interquartile range(IQR) is the range between the first quartile and the third quartile (the middle 50% of the data).
-
ctbrown over 5 yearsThat is not a "common" definition of what an outlier is. It is an arbitrary one.
-
Kyle Peters over 5 yearsIf you take a 101 statistics class in college, they will give you this definition of what an outlier is. Check the website in my answer. There are other definitions of what an outlier is, but this is the most basic and most used one. And, the definition I posted is more accurate than the one given in the question. If you had the data (.99998,1,1,1,1,1,1,1,1.0001), then .99998 and 1.0001 would be classified wrongly as outliers if you used the outlier classification method described in the question.
-
Ben over 4 yearsNice. Or you could roll squish into your own function.
cap <- function(x, low, high) pmin(high, pmax(low, x))
-
Jason Goal over 2 yearscheck the .clip function from pandas pandas.pydata.org/docs/reference/api/… as well