Can scipy.stats identify and mask obvious outliers?

24,091

Solution 1

The statsmodels package has what you need. Look at this little code snippet and its output:

# Imports #
import statsmodels.api as smapi
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*10 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make graph #
regression = smapi.OLS(x, y).fit()
figure = smgraphics.regressionplots.plot_fit(regression, 0)
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5)
print 'Outliers: ', list(outliers)

Example figure 1

Outliers: [(15, 220)]

Edit

With the newer version of statsmodels, things have changed a bit. Here is a new code snippet that shows the same type of outlier detection.

# Imports #
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5)
print 'Outliers: ', list(outliers)
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])

Example figure 2

Outliers: [(15, 220)]

Solution 2

scipy.stats doesn't have anything directly for outliers, so as answer some links and advertising for statsmodels (which is a statistics complement for scipy.stats)

for identifying outliers

http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html

http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html

http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

instead of masking, a better approach is to use a robust estimator

http://statsmodels.sourceforge.net/devel/rlm.html

with examples, where unfortunately the plots are currently not displayed http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html

RLM downweights outliers. The estimation results have a weights attribute, and for outliers the weights are smaller than 1. This can also be used for finding outliers. RLM is also more robust if the are several outliers.

Solution 3

More generally (i.e. programmatically) is there a way to identify and mask outliers?

Various outlier detection algorithms exist; scikit-learn implements a few of them.

[Disclaimer: I'm a scikit-learn contributor.]

Share:
24,091
a different ben
Author by

a different ben

A poor scientist struggling with data analysis and visualisation.

Updated on July 09, 2022

Comments

  • a different ben
    a different ben almost 2 years

    With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y scatter plot for outliers. More generally (i.e. programmatically) is there a way to identify and mask outliers?