This lab on Subset Selection is a Python adaptation of p. 244-247 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016).

Want to follow along on your own machine? Download the .py or Jupyter Notebook version.

```
%matplotlib inline
import pandas as pd
import numpy as np
import itertools
import time
import statsmodels.api as sm
import matplotlib.pyplot as plt
```

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball playerâ€™s Salary on the basis of various statistics associated with performance in the previous year. Let's take a quick look:

```
hitters_df = pd.read_csv('Hitters.csv')
hitters_df.head()
```

`Salary`

variable is missing for some of the
players. The `isnull()`

function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a `TRUE`

value
for any elements that are missing, and a `FALSE`

value for non-missing elements.
The `sum()`

function can then be used to count all of the missing elements:

```
print("Number of null values:", hitters_df["Salary"].isnull().sum())
```

`Salary`

is missing for 59 players. The `dropna()`

function
removes all of the rows that have missing values in any variable:

```
# Print the dimensions of the original Hitters data (322 rows x 20 columns)
print("Dimensions of original data:", hitters_df.shape)
# Drop any rows the contain missing values, along with the player names
hitters_df_clean = hitters_df.dropna().drop('Player', axis=1)
# Print the dimensions of the modified Hitters data (263 rows x 20 columns)
print("Dimensions of modified data:", hitters_df_clean.shape)
# One last check: should return 0
print("Number of null values:", hitters_df_clean["Salary"].isnull().sum())
```

`pandas`

to generate dummy variables for them, separate out the response variable, and stick everything back together again:

```
dummies = pd.get_dummies(hitters_df_clean[['League', 'Division', 'NewLeague']])
y = hitters_df_clean.Salary
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X_ = hitters_df_clean.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
# Define the feature set X.
X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
```

**best** is quantified using RSS. We'll define a helper function to outputs the best set of variables for
each model size:

```
def processSubset(feature_set):
# Fit model on feature_set and calculate RSS
model = sm.OLS(y,X[list(feature_set)])
regr = model.fit()
RSS = ((regr.predict(X[list(feature_set)]) - y) ** 2).sum()
return {"model":regr, "RSS":RSS}
```

```
def getBest(k):
tic = time.time()
results = []
for combo in itertools.combinations(X.columns, k):
results.append(processSubset(combo))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed", models.shape[0], "models on", k, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
```

`DataFrame`

containing the best model that we generated, along with some extra information about the model. Now we want to call that function for each number of predictors $k$:

```
# Could take quite awhile to complete...
models_best = pd.DataFrame(columns=["RSS", "model"])
tic = time.time()
for i in range(1,8):
models_best.loc[i] = getBest(i)
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
```

Now we have one big `DataFrame`

that contains the best models we've generated along with their RSS:

```
models_best
```

`summary()`

function:

```
print(models_best.loc[2, "model"].summary())
```

`Hits`

and `CRBI`

. To save time, we only generated results
up to the best 7-variable model. You can use the functions we defined above to explore as many variables as are desired.

```
# Show the best 19-variable model (there's actually only one)
print(getBest(19)["model"].summary())
```

`summary()`

function print to the screen, we can access just the parts we need using the model's attributes. For example, if we want the $R^2$ value:

```
models_best.loc[2, "model"].rsquared
```

`OLM`

also produced many other useful statistics such as adjusted $R^2$, AIC, and BIC. We can examine these to try to select the best overall model. Let's start by looking at $R^2$ across all our models:

```
# Gets the second element from each row ('model') and pulls out its rsquared attribute
models_best.apply(lambda row: row[1].rsquared, axis=1)
```

As expected, the $R^2$ statistic increases monotonically as more variables are included.

Plotting RSS, adjusted $R^2$, AIC, and BIC for all of the models at once will
help us decide which model to select. Note the `type="l"`

option tells `R`

to
connect the plotted points with lines:

```
plt.figure(figsize=(20,10))
plt.rcParams.update({'font.size': 18, 'lines.markersize': 10})
# Set up a 2x2 grid so we can look at 4 plots at once
plt.subplot(2, 2, 1)
# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The argmax() function can be used to identify the location of the maximum point of a vector
plt.plot(models_best["RSS"])
plt.xlabel('# Predictors')
plt.ylabel('RSS')
# We will now plot a red dot to indicate the model with the largest adjusted R^2 statistic.
# The argmax() function can be used to identify the location of the maximum point of a vector
rsquared_adj = models_best.apply(lambda row: row[1].rsquared_adj, axis=1)
plt.subplot(2, 2, 2)
plt.plot(rsquared_adj)
plt.plot(rsquared_adj.argmax(), rsquared_adj.max(), "or")
plt.xlabel('# Predictors')
plt.ylabel('adjusted rsquared')
# We'll do the same for AIC and BIC, this time looking for the models with the SMALLEST statistic
aic = models_best.apply(lambda row: row[1].aic, axis=1)
plt.subplot(2, 2, 3)
plt.plot(aic)
plt.plot(aic.argmin(), aic.min(), "or")
plt.xlabel('# Predictors')
plt.ylabel('AIC')
bic = models_best.apply(lambda row: row[1].bic, axis=1)
plt.subplot(2, 2, 4)
plt.plot(bic)
plt.plot(bic.argmin(), bic.min(), "or")
plt.xlabel('# Predictors')
plt.ylabel('BIC')
```

We can also use a similar approach to perform forward stepwise or backward stepwise selection, using a slight modification of the functions we defined above:

```
def forward(predictors):
# Pull out predictors we still need to process
remaining_predictors = [p for p in X.columns if p not in predictors]
tic = time.time()
results = []
for p in remaining_predictors:
results.append(processSubset(predictors+[p]))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed ", models.shape[0], "models on", len(predictors)+1, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
```

Now let's see how much faster it runs!

```
models_fwd = pd.DataFrame(columns=["RSS", "model"])
tic = time.time()
predictors = []
for i in range(1,len(X.columns)+1):
models_fwd.loc[i] = forward(predictors)
predictors = models_fwd.loc[i]["model"].model.exog_names
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
```

Phew! That's a lot better. Let's take a look:

```
print(models_fwd.loc[1, "model"].summary())
print(models_fwd.loc[2, "model"].summary())
```

`Hits`

, and the best two-variable model additionally
includes `CRBI`

. Let's see how the models stack up against best subset selection:

```
print(models_best.loc[6, "model"].summary())
print(models_fwd.loc[6, "model"].summary())
```

For this data, the best one-variable through six-variable models are each identical for best subset and forward selection.

Not much has to change to implement backward selection... just looping through the predictors in reverse!

```
def backward(predictors):
tic = time.time()
results = []
for combo in itertools.combinations(predictors, len(predictors)-1):
results.append(processSubset(combo))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest RSS
best_model = models.loc[models['RSS'].argmin()]
toc = time.time()
print("Processed ", models.shape[0], "models on", len(predictors)-1, "predictors in", (toc-tic), "seconds.")
# Return the best model, along with some other useful information about the model
return best_model
```

```
models_bwd = pd.DataFrame(columns=["RSS", "model"], index = range(1,len(X.columns)))
tic = time.time()
predictors = X.columns
while(len(predictors) > 1):
models_bwd.loc[len(predictors)-1] = backward(predictors)
predictors = models_bwd.loc[len(predictors)-1]["model"].model.exog_names
toc = time.time()
print("Total elapsed time:", (toc-tic), "seconds.")
```

```
print("------------")
print("Best Subset:")
print("------------")
print(models_best.loc[7, "model"].params)
```

```
print("-----------------")
print("Foward Selection:")
print("-----------------")
print(models_fwd.loc[7, "model"].params)
```

```
print("-------------------")
print("Backward Selection:")
print("-------------------")
print(models_bwd.loc[7, "model"].params)
```