This lab on Principal Components Analysis in R is an adaptation of p. 401-404, 408-410 of “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016), and re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College.

10.4 Lab 1: Principal Components Analysis

In this lab, we perform PCA on the USArrests data set, which is part of the base R package. The rows of the data set contain the 50 states, in alphabetical order:

library(dplyr)
library(ggplot2)
states = row.names(USArrests)
states

The columns of the data set contain four variables relating to various crimes:

names(USArrests)

Let’s start by taking a quick look at the column means of the data. We can use the apply() function to apply a function - in this case, the mean() function - to each row or column of the data set. The second input here denotes whether we wish to compute the mean of the rows, 1, or the columns, 2:

apply(USArrests, 2, mean)

We see right away the the data have vastly different means. We can also examine the variances of the four variables using the apply() function:

apply(USArrests, 2, var)

Not surprisingly, the variables also have vastly different variances: the UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of crimes committeed in each state per 100,000 individuals. If we failed to scale the variables before performing PCA, then most of the principal components that we observed would be driven by the Assault variable, since it has by far the largest mean and variance.

Thus, it is important to standardize the variables to have mean zero and standard deviation 1 before performing PCA. In this lab, we’ll perform principal components analysis using the prcomp() function, which is one of several functions in R that perform PCA. By default, the prcomp() function centers the variables to have mean zero. By using the option scale = TRUE, we scale the variables to have standard deviation 1:

pr_out = prcomp(USArrests, scale = TRUE)

The output from prcomp() contains a number of useful quantities:

names(pr_out)

The center and scale components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA:

pr_out$center
pr_out$scale

The rotation matrix provides the principal component loadings; each column of pr_out$rotation contains the corresponding principal component loading vector:

pr_out$rotation

We see that there are four distinct principal components. This is to be expected because there are in general min(n − 1, p) informative principal components in a data set with $n$ observations and $p$ variables.

Using the prcomp() function, we do not need to explicitly multiply the data by the principal component loading vectors in order to obtain the principal component score vectors. Rather the 50 × 4 matrix $x$ has as its columns the principal component score vectors. That is, the $k^{thcolumn is the $k^{th principal component score vector. We’ll take a look at the first few states:

head(pr_out$x)

We can plot the first two principal components using the biplot() function:

biplot(pr_out, scale = 0)

The scale = 0 argument to biplot() ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.

The prcomp() function also outputs the standard deviation of each principal component. We can access these standard deviations as follows:

pr_out$sdev

The variance explained by each principal component is obtained by squaring these:

pr_var = data.frame(varExp = pr_out$sdev^2)
pr_var

To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:

pr_var = pr_var %>%
  mutate(pve = varExp / sum(varExp))

We see that the first principal component explains 62.0% of the variance in the data, the next principal component explains 24.7% of the variance, and so forth. We can plot the PVE explained by each component as follows:

ggplot(pr_var, aes(as.numeric(row.names(pr_var)), pve)) +
  geom_point() +
  xlab("Principal Component") +
  ylab("Proportion of Variance Explained")

We can also use the function cumsum(), which computes the cumulative sum of the elements of a numeric vector, to plot the cumulative PVE:

ggplot(pr_var, aes(as.numeric(row.names(pr_var)), cumsum(pve))) +
  geom_point() +
  xlab("Principal Component") +
  ylab("Cumulative Proportion of Variance Explained")

10.6: NCI60 Data Example

Let’s return to the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines:

library(ISLR)
nci_labs = NCI60$labs
nci_data = NCI60$data

10.6.1 PCA on the NCI60 Data

We first perform PCA on the data after scaling the variables (genes) to have standard deviation one, although one could reasonably argue that it is better not to scale the genes:

pr_out_NCI = prcomp(nci_data, scale = TRUE)

We now plot the first few principal component score vectors, in order to visualize the data. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other.

components_NCI = data.frame(pr_out_NCI$x)

# This library allows up to view ggplot figures side by side
library(gridExtra)

# First and second Principal Components
plot1 = ggplot(components_NCI, aes(PC1,PC2)) +
  geom_point(aes(colour = factor(nci_labs)))

# First and third Principal Components
plot2 = ggplot(components_NCI, aes(PC1,PC3)) +
  geom_point(aes(colour = factor(nci_labs)))

grid.arrange(plot1, plot2, ncol=2)

On the whole, cell lines corresponding to a single cancer type do tend to have similar values on the first few principal component score vectors. This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels.

We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the summary() method for a prcomp object:

summary(pr_out_NCI)

We can again plot the variance explained by the first few principal components:

pr_var_NCI = data.frame(varExp = pr_out_NCI$sdev^2)
pr_var_NCI = pr_var_NCI %>%
  mutate(pve = varExp / sum(varExp))

ggplot(pr_var_NCI, aes(as.numeric(row.names(pr_var_NCI)), varExp)) +
  geom_point() +
  xlab("Principal Component") +
  ylab("Variance Explained")

However, as we saw before it is generally more informative to plot the PVE of each principal component (i.e. a scree plot) and the cumulative PVE of each principal component. This can be done with just a little tweaking:

plot1 = ggplot(pr_var_NCI, aes(as.numeric(row.names(pr_var_NCI)), pve)) +
  geom_point() +
  xlab("Principal Component") +
  ylab("Proportion of Variance Explained")

plot2 = ggplot(pr_var_NCI, aes(as.numeric(row.names(pr_var_NCI)), cumsum(varExp))) +
  geom_point() +
  xlab("Principal Component") +
  ylab("Cumulative Proportion of Variance Explained")

grid.arrange(plot1, plot2, ncol=2)

We see that together, the first seven principal components explain around 40% of the variance in the data. This is not a huge amount of the variance. However, looking at the scree plot, we see that while each of the first seven principal components explain a substantial amount of variance, there is a marked decrease in the variance explained by further principal components. That is, there is an elbow in the plot after approximately the seventh principal component. This suggests that there may be little benefit to examining more than seven or so principal components (phew! even examining seven principal components may be difficult).