This lab on Subset Selection in R comes from p. 244-247 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College.

Want to follow along on your own machine? Download the rMarkdown or Jupyter Notebook version.

6.5.1 Best Subset Selection

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. Let's take a quick look:

library(ISLR)
library(dplyr)
head(Hitters)

First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE value for any elements that are missing, and a FALSE value for non-missing elements. The sum() function can then be used to count all of the missing elements:

Hitters %>%
  select(Salary) %>%
  is.na() %>%
  sum()

We see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable:

# Print the dimensions of the original Hitters data (322 rows x 20 columns)
dim(Hitters)

# Drop any rows the contain missing values
Hitters = Hitters %>%
  na.omit()

# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
dim(Hitters)

# One last check: should return 0
Hitters %>%
  is.na() %>%
  sum()

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

library(leaps)
regfit_full = regsubsets(Salary~., data = Hitters)
summary(regfit_full)

An asterisk ("*") indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model:

regfit_full = regsubsets(Salary~., data = Hitters, nvmax = 19)
reg_summary = summary(regfit_full)

Notice that rather than letting the results of our call to the summary() function print to the screen, we've saved the results to a variable called reg_summary. That way, we can access just the parts we need. Let's see what's in there:

names(reg_summary)

Excellent! In addition to the verbose output we get when we print the summary to the screen, the summary() function also returns $R^2 (\tt{rsq})$, RSS, adjusted $R^2$, $C_p$, and BIC. We can examine these to try to select the best overall model. Let's start by looking at $R^2$:

reg_summary$rsq

We see that the $R^2$ statistic increases from 32% when only one variable is included in the model to almost 55% when all variables are included. As expected, the $R^2$ statistic increases monotonically as more variables are included.

Plotting RSS, adjusted $R^2$, $C_p$, and BIC for all of the models at once will help us decide which model to select. Note the type="l" option tells R to connect the plotted points with lines:

# Set up a 2x2 grid so we can look at 4 plots at once
par(mfrow = c(2,2))
plot(reg_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")

# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(reg_summary$adjr2) # 11

# The points() command works like the plot() command, except that it puts points 
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, reg_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)

# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(reg_summary$cp) # 10
points(cp_min, reg_summary$cp[cp_min], col = "red", cex = 2, pch = 20)

plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(reg_summary$bic) # 6
points(bic_min, reg_summary$bic[bic_min], col = "red", cex = 2, pch = 20)

Recall that in the second step of our selection process, we narrowed the field down to just one model on any $k<=p$ predictors. We see that according to BIC, the best performer is the model with 6 variables. According to $C_p$, 10 variables. Adjusted $R^2$ suggests that 11 might be best. Again, no one measure is going to give us an entirely accurate picture... but they all agree that a model with 5 or fewer predictors is insufficient, and a model with more than 12 is overfitting.

The regsubsets() function has a built-in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to a chosen statistic. The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic.

To find out more about this function, type ?plot.regsubsets.

plot(regfit_full, scale = "r2")

As expected, $R^2$ is maximized by the model that contains all 20 predictors.

plot(regfit_full, scale = "adjr2")

Adjusted $R^2$ downselects to just 11 predictors. We can use the coef() function to see which predictors made the cut:

coef(regfit_full, 11)
plot(regfit_full, scale = "Cp")

$C_p$ downselects further, dropping the LeagueN predictor and bringing the number down to 10:

coef(regfit_full, 10)
plot(regfit_full, scale = "bic")

We see that several models share a BIC close to −150. However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts:

coef(regfit_full, 6)

6.5.2 Forward and Backward Stepwise Selection

We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method="forward" or method="backward".

# Forward
regfit_fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit_fwd)
# Backward
regfit_bwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "backward")
summary(regfit_bwd)

We see that using forward stepwise selection, the best onevariable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.

coef(regfit_full, 7)
coef(regfit_fwd, 7)
coef(regfit_bwd, 7)

Getting credit

To get credit for this lab, please post an example where you would choose to use each of the following:

  • Best subset
  • Forward selection
  • Backward selection

to Moodle