Principal Component Analysis (PCA) is a technique for reducing the dimensions of a matrix, . The first principal component () is the linear combination that captures the most variance in the data; the second principal component () is the linear combination that captures the next most variance in the data, subject to the constraint that is orthogonal to (uncorrelated with) , etc.

The plot below shows an example of 2-dimensional data, with the lines demonstrating the directions of the principal components.

Image from Wikipedia

You can have as many principal components as there are predictors in a dataset, but one of the goals of PCA is to reduce the dimensions of the data, so we typically want fewer components than there are predictors.

Imagine we have 2-dimensional data. would be the linear combination that, if we projected the 2-dimensional data onto a line, would maximize the variance of . We can describe this projection via weights (or loadings). For instance, to get , we could do:

where the values are the loadings.

We tend to use PCA when we have many highly-correlated predictors , and so reducing the number of predictors in the model to where can reduce the variance of the model.

One thing to keep in mind is that PCA is a dimension reduction technique, it is not a feature selection technique. This is because, when we construct the principal components, every feature in our dataset, , contributes to these components.

When estimating principal components, it’s helpful for all predictors in to be standardized.

Principal Components Regression

We could stop at just reducing the dimensions of and use PCA as an exploratory technique, but we often proceed to plug these PCs into some sort of model, such as a linear regression model. For instance, after we estimate the values (i.e. the PC values) as shown above, we can plug them into a linear regression, e.g.:

The logic behind principal components regression is that, often, only a few principal components can explain most of the variability in the data, and so we can get better results by fitting a model with fewer predictors.

Not surprisingly, PCR performs best when the first few principal components account for most of the variance in the predictors. If we need lots of PCs to get to a sufficient amount of the variance in , then shrinkage methods like ridge regression or LASSO will work better.