This lab on Subset Selection in R comes from p. 244-247 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in tidyverse
format by Amelia McNamara and R. Jordan Crouser at Smith College.
Want to follow along on your own machine? Download the rMarkdown or Jupyter Notebook version.
Here we apply the best subset selection approach to the Hitters
data. We
wish to predict a baseball player’s Salary
on the basis of various statistics
associated with performance in the previous year. Let's take a quick look:
library(ISLR)
library(dplyr)
head(Hitters)
First of all, we note that the Salary
variable is missing for some of the
players. The is.na()
function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE
value
for any elements that are missing, and a FALSE
value for non-missing elements.
The sum()
function can then be used to count all of the missing elements:
Hitters %>%
select(Salary) %>%
is.na() %>%
sum()
We see that Salary
is missing for 59 players. The na.omit()
function
removes all of the rows that have missing values in any variable:
# Print the dimensions of the original Hitters data (322 rows x 20 columns)
dim(Hitters)
# Drop any rows the contain missing values
Hitters = Hitters %>%
na.omit()
# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
dim(Hitters)
# One last check: should return 0
Hitters %>%
is.na() %>%
sum()
The regsubsets()
function (part of the leaps
library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm()
. The summary()
command outputs the best set of variables for
each model size.
library(leaps)
regfit_full = regsubsets(Salary~., data = Hitters)
summary(regfit_full)
An asterisk ("*") indicates that a given variable is included in the corresponding
model. For instance, this output indicates that the best two-variable model
contains only Hits
and CRBI
. By default, regsubsets()
only reports results
up to the best eight-variable model. But the nvmax
option can be used
in order to return as many variables as are desired. Here we fit up to a
19-variable model:
regfit_full = regsubsets(Salary~., data = Hitters, nvmax = 19)
reg_summary = summary(regfit_full)
Notice that rather than letting the results of our call to the summary()
function print to the screen, we've saved the results to a variable called reg_summary
. That way, we can access just the parts we need. Let's see what's in there:
names(reg_summary)
Excellent! In addition to the verbose output we get when we print the summary to the screen, the summary()
function also returns $R^2 (\tt{rsq})$, RSS, adjusted $R^2$, $C_p$, and BIC. We can examine these to try to select the best overall model. Let's start by looking at $R^2$:
reg_summary$rsq
We see that the $R^2$ statistic increases from 32% when only one variable is included in the model to almost 55% when all variables are included. As expected, the $R^2$ statistic increases monotonically as more variables are included.
Plotting RSS, adjusted $R^2$, $C_p$, and BIC for all of the models at once will
help us decide which model to select. Note the type="l"
option tells R
to
connect the plotted points with lines:
# Set up a 2x2 grid so we can look at 4 plots at once
par(mfrow = c(2,2))
plot(reg_summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The which.max() function can be used to identify the location of the maximum point of a vector
adj_r2_max = which.max(reg_summary$adjr2) # 11
# The points() command works like the plot() command, except that it puts points
# on a plot that has already been created instead of creating a new plot
points(adj_r2_max, reg_summary$adjr2[adj_r2_max], col ="red", cex = 2, pch = 20)
# We'll do the same for C_p and BIC, this time looking for the models with the SMALLEST statistic
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
cp_min = which.min(reg_summary$cp) # 10
points(cp_min, reg_summary$cp[cp_min], col = "red", cex = 2, pch = 20)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
bic_min = which.min(reg_summary$bic) # 6
points(bic_min, reg_summary$bic[bic_min], col = "red", cex = 2, pch = 20)
Recall that in the second step of our selection process, we narrowed the field down to just one model on any $k<=p$ predictors. We see that according to BIC, the best performer is the model with 6 variables. According to $C_p$, 10 variables. Adjusted $R^2$ suggests that 11 might be best. Again, no one measure is going to give us an entirely accurate picture... but they all agree that a model with 5 or fewer predictors is insufficient, and a model with more than 12 is overfitting.
The regsubsets()
function has a built-in plot()
command which can
be used to display the selected variables for the best model with a given
number of predictors, ranked according to a chosen statistic. The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic.
To find out more about this function, type ?plot.regsubsets
.
plot(regfit_full, scale = "r2")
As expected, $R^2$ is maximized by the model that contains all 20 predictors.
plot(regfit_full, scale = "adjr2")
Adjusted $R^2$ downselects to just 11 predictors. We can use the coef()
function to see which predictors made the cut:
coef(regfit_full, 11)
plot(regfit_full, scale = "Cp")
$C_p$ downselects further, dropping the LeagueN
predictor and bringing the number down to 10:
coef(regfit_full, 10)
plot(regfit_full, scale = "bic")
We see that several models share a BIC close to −150. However, the model
with the lowest BIC is the six-variable model that contains only AtBat,
Hits, Walks, CRBI, DivisionW,
and PutOuts
:
coef(regfit_full, 6)
We can also use the regsubsets()
function to perform forward stepwise
or backward stepwise selection, using the argument method="forward"
or
method="backward"
.
# Forward
regfit_fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit_fwd)
# Backward
regfit_bwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "backward")
summary(regfit_bwd)
We see that using forward stepwise selection, the best onevariable
model contains only CRBI
, and the best two-variable model additionally
includes Hits
. For this data, the best one-variable through six-variable
models are each identical for best subset and forward selection.
However, the best seven-variable models identified by forward stepwise selection,
backward stepwise selection, and best subset selection are different.
coef(regfit_full, 7)
coef(regfit_fwd, 7)
coef(regfit_bwd, 7)