require(mosaic)
The book uses the First-Year GPA data set to illustrate automated procedures for variable selection in multiple regression. Since the book uses Minitab, we will work through some examples of how to do this in R.
First, we have to load the data.
GPA = read.csv("http://www.math.smith.edu/~bbaumer/mth247/FirstYearGPA.csv")
dim(GPA)
## [1] 220 10
head(GPA)
## GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1
## 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1
## 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1
## 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1
## 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1
## 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1
Keep in mind that there is no such thing as a “best” multiple regression model. Each model has its strengths and weaknesses, and it is your job as the data anlayst to construct a model that is effective for your purposes. That could mean forgoing a multiple with a higher \( R^2 \) for one that is conceptually simpler. Or it could mean added polynomial terms that explain only a little bit of extra variation in the response.
With this caveat, there is no way to fully automate a procedure that will find the best regression model. Thus, there are several algorithms for optimization a given criterion. There are at least four different optimization procedures:
We will discuss each. Furthermore, each of these techniques can optimize a different criterion. There is no universally agreed-upon best criterion, but the following are popularly used:
While the book focuses on Mallow's \( C_p \), we will focus on AIC.
The most straightforward method for variable selection is to simply try all of the subsets. Unfortunately, if the number of predictors is \( k \), then you have to try all \( 2^k \) subsets! Thus, the number of subsets grows exponentially, which is very fast. [Computer scientists refer to such algorithms as having exponential running times, which are considered extremely inefficient.]
The following code will produce all subset regression.
require(leaps)
## Loading required package: leaps
# Reports the two bests models for each number of predictors
best = summary(regsubsets(GPA ~ ., data = GPA, nbest = 2))
data.frame(best$rsq, best$adjr2, best$cp, best$rss, best$outmat)
## best.rsq best.adjr2 best.cp best.rss HSGPA SATV SATM Male HU SS
## 1 ( 1 ) 0.19971 0.19602 42.181 37.80 *
## 1 ( 2 ) 0.09901 0.09486 74.541 42.56 *
## 2 ( 1 ) 0.26972 0.26295 21.683 34.49 * *
## 2 ( 2 ) 0.26808 0.26130 22.210 34.57 *
## 3 ( 1 ) 0.32309 0.31364 6.531 31.97 * *
## 3 ( 2 ) 0.30792 0.29826 11.407 32.69 * * *
## 4 ( 1 ) 0.33750 0.32512 3.900 31.29 * * *
## 4 ( 2 ) 0.32964 0.31711 6.426 31.66 * *
## 5 ( 1 ) 0.34375 0.32834 3.892 31.00 * * * *
## 5 ( 2 ) 0.34103 0.32557 4.764 31.13 * * * *
## 6 ( 1 ) 0.34699 0.32851 4.849 30.84 * * * * *
## 6 ( 2 ) 0.34627 0.32777 5.080 30.88 * * * *
## 7 ( 1 ) 0.34938 0.32780 6.081 30.73 * * * * *
## 7 ( 2 ) 0.34725 0.32560 6.766 30.83 * * * * *
## 8 ( 1 ) 0.34952 0.32474 8.036 30.72 * * * * *
## 8 ( 2 ) 0.34949 0.32471 8.046 30.73 * * * * * *
## FirstGen White CollegeBound
## 1 ( 1 )
## 1 ( 2 )
## 2 ( 1 )
## 2 ( 2 ) *
## 3 ( 1 ) *
## 3 ( 2 )
## 4 ( 1 ) *
## 4 ( 2 ) * *
## 5 ( 1 ) *
## 5 ( 2 ) *
## 6 ( 1 ) *
## 6 ( 2 ) * *
## 7 ( 1 ) * *
## 7 ( 2 ) * *
## 8 ( 1 ) * * *
## 8 ( 2 ) * *
Backwards Elimination is a simple algorithm that begins by throwing all of the terms into the model, and then greedily removing the ones that are least statistically significant.
fm.full = lm(GPA ~ ., data = GPA)
step(fm.full, direction = "backward")
## Start: AIC=-410.2
## GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White +
## CollegeBound
##
## Df Sum of Sq RSS AIC
## - SATM 1 0.01 30.7 -412
## - CollegeBound 1 0.01 30.7 -412
## - FirstGen 1 0.10 30.8 -411
## - Male 1 0.11 30.8 -411
## - SS 1 0.26 31.0 -410
## <none> 30.7 -410
## - SATV 1 0.33 31.0 -410
## - White 1 1.15 31.9 -404
## - HU 1 2.44 33.2 -395
## - HSGPA 1 6.43 37.2 -371
##
## Step: AIC=-412.1
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White + CollegeBound
##
## Df Sum of Sq RSS AIC
## - CollegeBound 1 0.01 30.7 -414
## - FirstGen 1 0.11 30.8 -413
## - Male 1 0.14 30.9 -413
## - SS 1 0.25 31.0 -412
## <none> 30.7 -412
## - SATV 1 0.45 31.2 -411
## - White 1 1.18 31.9 -406
## - HU 1 2.46 33.2 -397
## - HSGPA 1 6.58 37.3 -372
##
## Step: AIC=-414.1
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White
##
## Df Sum of Sq RSS AIC
## - FirstGen 1 0.11 30.8 -415
## - Male 1 0.15 30.9 -415
## - SS 1 0.25 31.0 -414
## <none> 30.7 -414
## - SATV 1 0.47 31.2 -413
## - White 1 1.17 31.9 -408
## - HU 1 2.45 33.2 -399
## - HSGPA 1 6.76 37.5 -373
##
## Step: AIC=-415.3
## GPA ~ HSGPA + SATV + Male + HU + SS + White
##
## Df Sum of Sq RSS AIC
## - Male 1 0.15 31.0 -416
## - SS 1 0.28 31.1 -415
## <none> 30.8 -415
## - SATV 1 0.59 31.4 -413
## - White 1 1.29 32.1 -408
## - HU 1 2.82 33.7 -398
## - HSGPA 1 6.64 37.5 -375
##
## Step: AIC=-416.2
## GPA ~ HSGPA + SATV + HU + SS + White
##
## Df Sum of Sq RSS AIC
## <none> 31.0 -416
## - SS 1 0.30 31.3 -416
## - SATV 1 0.70 31.7 -413
## - White 1 1.31 32.3 -409
## - HU 1 2.80 33.8 -399
## - HSGPA 1 6.50 37.5 -377
##
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = GPA)
##
## Coefficients:
## (Intercept) HSGPA SATV HU SS
## 0.568488 0.473998 0.000748 0.016745 0.007747
## White
## 0.206041
Forward Selection is the opposite idea of Backwards Elimination. Here, we begin with an empty model and then greedily add terms that have a statistically significant effect.
fm.empty = lm(GPA ~ 1, data = GPA)
step(fm.empty, scope = ~. + SATV + SATM + HSGPA + Male + HU + SS + FirstGen +
White + CollegeBound, direction = "forward")
## Start: AIC=-333.9
## GPA ~ 1
##
## Df Sum of Sq RSS AIC
## + HSGPA 1 9.43 37.8 -381
## + HU 1 4.68 42.6 -355
## + SATV 1 4.37 42.9 -353
## + White 1 3.75 43.5 -350
## + SATM 1 1.78 45.4 -340
## + FirstGen 1 1.16 46.1 -337
## <none> 47.2 -334
## + CollegeBound 1 0.19 47.0 -333
## + Male 1 0.13 47.1 -333
## + SS 1 0.00 47.2 -332
##
## Step: AIC=-380.7
## GPA ~ HSGPA
##
## Df Sum of Sq RSS AIC
## + HU 1 3.31 34.5 -399
## + White 1 3.23 34.6 -398
## + SATV 1 2.19 35.6 -392
## + FirstGen 1 1.63 36.2 -388
## + SATM 1 0.77 37.0 -383
## + Male 1 0.41 37.4 -381
## <none> 37.8 -381
## + CollegeBound 1 0.03 37.8 -379
## + SS 1 0.00 37.8 -379
##
## Step: AIC=-398.8
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS AIC
## + White 1 2.521 32.0 -413
## + SATV 1 1.804 32.7 -409
## + SATM 1 0.860 33.6 -402
## + FirstGen 1 0.800 33.7 -402
## + Male 1 0.434 34.1 -400
## + SS 1 0.379 34.1 -399
## <none> 34.5 -399
## + CollegeBound 1 0.039 34.5 -397
##
## Step: AIC=-413.4
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS AIC
## + SATV 1 0.681 31.3 -416
## + FirstGen 1 0.309 31.7 -414
## <none> 32.0 -413
## + SATM 1 0.282 31.7 -413
## + Male 1 0.279 31.7 -413
## + SS 1 0.275 31.7 -413
## + CollegeBound 1 0.049 31.9 -412
##
## Step: AIC=-416.1
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS AIC
## + SS 1 0.2951 31.0 -416
## <none> 31.3 -416
## + Male 1 0.1670 31.1 -415
## + FirstGen 1 0.1560 31.1 -415
## + SATM 1 0.0269 31.3 -414
## + CollegeBound 1 0.0137 31.3 -414
##
## Step: AIC=-416.2
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS AIC
## <none> 31.0 -416
## + Male 1 0.1534 30.8 -415
## + FirstGen 1 0.1194 30.9 -415
## + SATM 1 0.0541 30.9 -415
## + CollegeBound 1 0.0188 31.0 -414
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = GPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
## 0.568488 0.473998 0.016745 0.206041 0.000748
## SS
## 0.007747
Stepwise regression combines the ideas of Backwards Elimination and Forward Selection to move in both directions.
step(fm.empty, scope = ~. + SATV + SATM + HSGPA + Male + HU + SS + FirstGen +
White + CollegeBound, direction = "both")
## Start: AIC=-333.9
## GPA ~ 1
##
## Df Sum of Sq RSS AIC
## + HSGPA 1 9.43 37.8 -381
## + HU 1 4.68 42.6 -355
## + SATV 1 4.37 42.9 -353
## + White 1 3.75 43.5 -350
## + SATM 1 1.78 45.4 -340
## + FirstGen 1 1.16 46.1 -337
## <none> 47.2 -334
## + CollegeBound 1 0.19 47.0 -333
## + Male 1 0.13 47.1 -333
## + SS 1 0.00 47.2 -332
##
## Step: AIC=-380.7
## GPA ~ HSGPA
##
## Df Sum of Sq RSS AIC
## + HU 1 3.31 34.5 -399
## + White 1 3.23 34.6 -398
## + SATV 1 2.19 35.6 -392
## + FirstGen 1 1.63 36.2 -388
## + SATM 1 0.77 37.0 -383
## + Male 1 0.41 37.4 -381
## <none> 37.8 -381
## + CollegeBound 1 0.03 37.8 -379
## + SS 1 0.00 37.8 -379
## - HSGPA 1 9.43 47.2 -334
##
## Step: AIC=-398.8
## GPA ~ HSGPA + HU
##
## Df Sum of Sq RSS AIC
## + White 1 2.52 32.0 -413
## + SATV 1 1.80 32.7 -409
## + SATM 1 0.86 33.6 -402
## + FirstGen 1 0.80 33.7 -402
## + Male 1 0.43 34.1 -400
## + SS 1 0.38 34.1 -399
## <none> 34.5 -399
## + CollegeBound 1 0.04 34.5 -397
## - HU 1 3.31 37.8 -381
## - HSGPA 1 8.06 42.6 -355
##
## Step: AIC=-413.4
## GPA ~ HSGPA + HU + White
##
## Df Sum of Sq RSS AIC
## + SATV 1 0.68 31.3 -416
## + FirstGen 1 0.31 31.7 -414
## <none> 32.0 -413
## + SATM 1 0.28 31.7 -413
## + Male 1 0.28 31.7 -413
## + SS 1 0.28 31.7 -413
## + CollegeBound 1 0.05 31.9 -412
## - White 1 2.52 34.5 -399
## - HU 1 2.60 34.6 -398
## - HSGPA 1 7.77 39.7 -368
##
## Step: AIC=-416.1
## GPA ~ HSGPA + HU + White + SATV
##
## Df Sum of Sq RSS AIC
## + SS 1 0.30 31.0 -416
## <none> 31.3 -416
## + Male 1 0.17 31.1 -415
## + FirstGen 1 0.16 31.1 -415
## + SATM 1 0.03 31.3 -414
## + CollegeBound 1 0.01 31.3 -414
## - SATV 1 0.68 32.0 -413
## - White 1 1.40 32.7 -409
## - HU 1 2.50 33.8 -401
## - HSGPA 1 6.56 37.9 -376
##
## Step: AIC=-416.2
## GPA ~ HSGPA + HU + White + SATV + SS
##
## Df Sum of Sq RSS AIC
## <none> 31.0 -416
## - SS 1 0.30 31.3 -416
## + Male 1 0.15 30.8 -415
## + FirstGen 1 0.12 30.9 -415
## + SATM 1 0.05 30.9 -415
## + CollegeBound 1 0.02 31.0 -414
## - SATV 1 0.70 31.7 -413
## - White 1 1.31 32.3 -409
## - HU 1 2.80 33.8 -399
## - HSGPA 1 6.50 37.5 -377
##
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = GPA)
##
## Coefficients:
## (Intercept) HSGPA HU White SATV
## 0.568488 0.473998 0.016745 0.206041 0.000748
## SS
## 0.007747