Principal Component Analysis in R

r pca

12,178

Solution 1

This is a reproducible example:

set.seed(1)
want <- sample(50, 40)
Iris <- iris[c(51:100, 101:150), ] ## only keep versicolor and virginica
## take our training and test sets
train <- droplevels(Iris[c((1:50)[want], (51:100)[want]), , drop = FALSE])
test <- droplevels(Iris[c((1:50)[-want], (51:100)[-want]), , drop = FALSE])

## fit the PCA
pc <- prcomp(train[, 1:4])

Now note that pc$x is the rotated data. You used X %*% pc$rotation (where X is the training data matrix) but didn't centre the data first, but they are equivalent. Centring predictors in the regression may be useful.

## create data frame for logistic regression
mydata <- data.frame(Species = train[, "Species"], pc$x)
## ...and fit the model
mod <- glm(Species ~ PC1, data = mydata, family = binomial)

Predict the scores on PC1 for the test set data; that is, rotate the test set using the same rotation used to form the PCs of the training data. For that we can use the predict() method for class "prcomp"

test.p <- predict(pc, newdata = test[, 1:4])

Now use that to predict the class

pred <- predict(mod, newdata = data.frame(test.p), type = "response")
pred

> pred
         56          66          67          71          72 
0.080427399 0.393133104 0.092661480 0.395813527 0.048277608 
         74          76          82          87          95 
0.226191156 0.333553423 0.003860679 0.617977807 0.029469167 
        106         116         117         121         122 
0.999648054 0.922145431 0.924464339 0.989271655 0.318477762 
        124         126         132         137         145 
0.581235903 0.995224501 0.999770995 0.964825109 0.988121496 
> 1 - pred
          56           66           67           71           72 
0.9195726006 0.6068668957 0.9073385196 0.6041864731 0.9517223918 
          74           76           82           87           95 
0.7738088439 0.6664465767 0.9961393215 0.3820221934 0.9705308332 
         106          116          117          121          122 
0.0003519463 0.0778545688 0.0755356606 0.0107283449 0.6815222382 
         124          126          132          137          145 
0.4187640970 0.0047754987 0.0002290047 0.0351748912 0.0118785036

pred contains the probability that the test observation is Iris virginica. Note that in glm() when the response is a factor (as in this example) then the first level of that factor (here versicolor) is taken as failure or 0 and the second and subsequent levels indicator success or 1. As in this example there are only two classes, the model is parameterised in terms of versicolor; 1 - pred will give the predicted probability of virginica.

I don't follow the error computation you included in the question and therefore will leave that up to you to work out. A cross-classification table of the success of the model can be generated however via:

> predSpecies <- factor(ifelse(pred >= 0.5, "virginica", "versicolor"))
> table(test$Species, predSpecies)
            predSpecies
             versicolor virginica
  versicolor          9         1
  virginica           1         9

indicating our model got two test set observations wrong.

Solution 2

You need to split your data into train and test as the very first step: otherwise the PC scores are far from being independent.

I.e. the PCA rotation is calculated from x [train,] only!

The same rotation is then applied to x [test,]

For everything else, as @Joran says, reproducible code is needed.

12,178

Author by

John Smith

Updated on July 28, 2022

Comments

John Smith almost 2 years

Using the prcomp function, how can I use unsupervised principal components derived from a dataset on the same dataset split into test and train?

train <- sample(1:nrow(auto), 60000)
x <- as.matrix(auto[,-1])  ##Covariates
y <- auto[,1]                   ##Response
pc <- prcomp(x)             ##Find Principal Components

data <- data.frame(y=y, (x %*% pc$rotation[,1:9]))
fit <- glm(y ~ ., data=data[train,], family="binomial")   ##Train It

prediction <- predict(fit, newdata=data) > 0  ##Prediction on Entire Data Set

error <- mean(y[-train]] != prediction[-train])  ##Mean out of Sample error