Detecting Outliers When Doing PCA

10,360

One does not normally consider variables outliers, but data points. The idea is that data in a given variable come from a particular distribution, but occasionally there is a value that for some specific reason deviates strongly from that distribution. After detecting and removing such outliers, distributional assumptions made by analysis procedures may be better fulfilled. A variable, on the other hand, is not normally considered to come from a distribution (of variables). It therefore does not make sense to consider a variable an outlier.


What you have here is the case that 4 of your variables are strongly correlated, but the 5th is almost uncorrelated to the rest:

corrcoef(data) = 
    1.0000    0.9959    0.9955    0.9957   -0.0296
    0.9959    1.0000    0.9934    0.9951   -0.0283
    0.9955    0.9934    1.0000    0.9962   -0.0392
    0.9957    0.9951    0.9962    1.0000   -0.0593
   -0.0296   -0.0283   -0.0392   -0.0593    1.0000

If you do the PCA you find that your data can be represented with almost no loss in two principal components, accounting for more than 99% of the total variance.

What you consider to make "social science" an outlier is your plot of the "principal components":

However, these axis labels are actually wrong. What you are plotting here are the coefficients of the first two eigenvectors of the covariance matrix:

eigenVectors(:, 1:2) = 
   -0.5091    0.0241
   -0.5013    0.0250
   -0.4885    0.0144
   -0.5000   -0.0038
    0.0300    0.9993

What these numbers and the resulting plot tell you is that the first 4 variables are mainly related to the first principal component, and in almost exactly the same way (coefficients approximately [-0.5 0]), while the 5th variable is almost identical to the second principal component (coefficients approximately [0 1]). This is why "social science" has its separate spot in your plot – but this doesn't mean that there is an "outlier".

Reading these coefficients columnwise (one eigenvector at a time) tells you that the first principal component can be obtained as negative twice the average of variables 1 through 4 ([-0.5 -0.5 -0.5 -0.5 0]), while the second principal component can be obtained by simply taking the 5th variable ([0 0 0 0 1]). These numbers are also called "loadings" of the original variables for the given principal component. The same numbers tell you how the respective principal component contributes to the original variables if these are to be reconstructed from the PCs. In this interpretation, the eigenvectors may be called "principal modes" in correspondence to "principal components".

The principal components or principal variates are the original set of variables transformed into a new set of variables using the eigenvector coefficients (or "loadings"):

PCs = data * eigenVectors;

Like your original variables, the principal components are functions of the student index:

subplot(2, 1, 1)
plot(PCs(:, 1), '.-')
ylabel('Principal Component 1')
subplot(2, 1, 2)
plot(PCs(:, 2), '.-')
ylabel('Principal Component 2')

In contrast to the original variables, the principal components are mutually uncorrelated, and, if the eigenvector matrix was sorted by descending eigenvalue, the resulting PCs are sorted in decreasing order of variance. The value of a principal component for a given data point is also called the component "score".

Again, the first PC is practically identical to the common variation in variables 1 to 4, while the second PC is practically identical to variable 5, which becomes also apparent if you simply plot the original data:

plot(data, '.-')
legend(colNames)

Share:
10,360
user3221699
Author by

user3221699

Updated on June 04, 2022

Comments

  • user3221699
    user3221699 almost 2 years

    I am new to data analysis and trying to better understand how I can identify outliers when doing PCA analysis. I have created a data matrix with 5 columns to represent my variables of Math, English, History, Physics, and Social Science; and each row represents the final grade a student received in the class. The fifth column in my data matrix is an outlier when I plot scores for the first and second principal components. I would like to have a way to mathematically detect outliers without having to plot the scores; any suggestion or ideas to aid me in doing that is greatly appreciated. Thanks in advance for your help. I have posted my code below.

    %Column names
    colNames = {'Math','English','History','Physics','Social Science'};
    
    %data matrix
    data = [75.23,74.65,77,73.04,72.11;
        88.50,89.43,86.23,88.50,50.97;
        66.12,65.12,67.45,66.02,66.54;
        89.23,90.43,88.21,88.23,71.21;
        72.35,72.43,73.56,74.32,63.51;
        50.23,52.34,51.78,52.32,59.85;
        58.79,58.79,58.79,58.79,91.08;
        86.08,86.08,86.08,86.08,71.49;
        73.67,73.67,73.67,73.67,94.38;
        56.34,57.63,58.23,58.32,69.55;
        67.05,69.42,66.32,65.32,88.45;
        81.23,80.36,80.32,79.89,69.83;
        59.68,59.58,60.32,59.02,90.42;
        87.34,86.92,85.23,86.01,87.75;
        63.21,62.14,62.03,62.32,68.86;
        95.87,94.54,95.65,96.12,60.80;
        64.34,63.45,63.45,63.45,89.52;
        89.32,87.54,88.27,88.01,97.46;
        59.65,58.23,60.32,59.43,66.37;
        63.98,64.37,65.01,64.01,83.56;
        56.34,55.35,53.98,54.25,71.93;
        79.98,78.81,78.01,77.99,91.67;
        84.16,85.021,83.99,84.87,88.44;
        78.38,77.32,76.98,77.56,58.36;
        71.28,72.98,71.99,71.56,93.09;];
    
    %Computing PCA
    covarianceMat=cov(data);
    [eigenVectors,eigenValues]=eigs(covarianceMat,5);
    
    %Sorting Eigen values
    [eigenValues I] = sort(diag(eigenValues),'descend');
    
    %Computing Variance Percentage of each Eigen value
    variancePercentage = (eigenValues ./ sum(eigenValues)) .*100;
    
    figure(2)
    plot(eigenVectors(:,1),eigenVectors(:,2),'*');
    xlabel('Principal Component 1');ylabel('Principal Component 2')
    for Loop = 1:length(colNames)
        text(eigenVectors(Loop,1),eigenVectors(Loop,2),colNames{Loop},'Color','r')
    end