MTH247: Introduction to Regression Analysis

MLR: Stepwise Regression – Fall 2012 – Prof. Baumer

require(mosaic)

The book uses the First-Year GPA data set to illustrate automated procedures for variable selection in multiple regression. Since the book uses Minitab, we will work through some examples of how to do this in R.

First, we have to load the data.

GPA = read.csv("http://www.math.smith.edu/~bbaumer/mth247/FirstYearGPA.csv")
dim(GPA)

## [1] 220  10

head(GPA)

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1

Keep in mind that there is no such thing as a “best” multiple regression model. Each model has its strengths and weaknesses, and it is your job as the data anlayst to construct a model that is effective for your purposes. That could mean forgoing a multiple with a higher \( R^2 \) for one that is conceptually simpler. Or it could mean added polynomial terms that explain only a little bit of extra variation in the response.

With this caveat, there is no way to fully automate a procedure that will find the best regression model. Thus, there are several algorithms for optimization a given criterion. There are at least four different optimization procedures:

All subsets regression
Backwards elimination
Forward selection
Stepwise regression

We will discuss each. Furthermore, each of these techniques can optimize a different criterion. There is no universally agreed-upon best criterion, but the following are popularly used:

Adjusted \( R^2 \)
Mallow's \( C_p \)
Akaike's Information Criterion (AIC)
Bayesian Information Criterion (BIC)

While the book focuses on Mallow's \( C_p \), we will focus on AIC.

All subsets regression

The most straightforward method for variable selection is to simply try all of the subsets. Unfortunately, if the number of predictors is \( k \), then you have to try all \( 2^k \) subsets! Thus, the number of subsets grows exponentially, which is very fast. [Computer scientists refer to such algorithms as having exponential running times, which are considered extremely inefficient.]

The following code will produce all subset regression.

require(leaps)

## Loading required package: leaps

# Reports the two bests models for each number of predictors
best = summary(regsubsets(GPA ~ ., data = GPA, nbest = 2))
data.frame(best$rsq, best$adjr2, best$cp, best$rss, best$outmat)

##          best.rsq best.adjr2 best.cp best.rss HSGPA SATV SATM Male HU SS
## 1  ( 1 )  0.19971    0.19602  42.181    37.80     *                     
## 1  ( 2 )  0.09901    0.09486  74.541    42.56                       *   
## 2  ( 1 )  0.26972    0.26295  21.683    34.49     *                 *   
## 2  ( 2 )  0.26808    0.26130  22.210    34.57     *                     
## 3  ( 1 )  0.32309    0.31364   6.531    31.97     *                 *   
## 3  ( 2 )  0.30792    0.29826  11.407    32.69     *    *            *   
## 4  ( 1 )  0.33750    0.32512   3.900    31.29     *    *            *   
## 4  ( 2 )  0.32964    0.31711   6.426    31.66     *                 *   
## 5  ( 1 )  0.34375    0.32834   3.892    31.00     *    *            *  *
## 5  ( 2 )  0.34103    0.32557   4.764    31.13     *    *         *  *   
## 6  ( 1 )  0.34699    0.32851   4.849    30.84     *    *         *  *  *
## 6  ( 2 )  0.34627    0.32777   5.080    30.88     *    *            *  *
## 7  ( 1 )  0.34938    0.32780   6.081    30.73     *    *         *  *  *
## 7  ( 2 )  0.34725    0.32560   6.766    30.83     *    *         *  *  *
## 8  ( 1 )  0.34952    0.32474   8.036    30.72     *    *         *  *  *
## 8  ( 2 )  0.34949    0.32471   8.046    30.73     *    *    *    *  *  *
##          FirstGen White CollegeBound
## 1  ( 1 )                            
## 1  ( 2 )                            
## 2  ( 1 )                            
## 2  ( 2 )              *             
## 3  ( 1 )              *             
## 3  ( 2 )                            
## 4  ( 1 )              *             
## 4  ( 2 )        *     *             
## 5  ( 1 )              *             
## 5  ( 2 )              *             
## 6  ( 1 )              *             
## 6  ( 2 )        *     *             
## 7  ( 1 )        *     *             
## 7  ( 2 )              *            *
## 8  ( 1 )        *     *            *
## 8  ( 2 )        *     *

Backward Elimination

Backwards Elimination is a simple algorithm that begins by throwing all of the terms into the model, and then greedily removing the ones that are least statistically significant.

fm.full = lm(GPA ~ ., data = GPA)
step(fm.full, direction = "backward")

## Start:  AIC=-410.2
## GPA ~ HSGPA + SATV + SATM + Male + HU + SS + FirstGen + White + 
##     CollegeBound
## 
##                Df Sum of Sq  RSS  AIC
## - SATM          1      0.01 30.7 -412
## - CollegeBound  1      0.01 30.7 -412
## - FirstGen      1      0.10 30.8 -411
## - Male          1      0.11 30.8 -411
## - SS            1      0.26 31.0 -410
## <none>                      30.7 -410
## - SATV          1      0.33 31.0 -410
## - White         1      1.15 31.9 -404
## - HU            1      2.44 33.2 -395
## - HSGPA         1      6.43 37.2 -371
## 
## Step:  AIC=-412.1
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White + CollegeBound
## 
##                Df Sum of Sq  RSS  AIC
## - CollegeBound  1      0.01 30.7 -414
## - FirstGen      1      0.11 30.8 -413
## - Male          1      0.14 30.9 -413
## - SS            1      0.25 31.0 -412
## <none>                      30.7 -412
## - SATV          1      0.45 31.2 -411
## - White         1      1.18 31.9 -406
## - HU            1      2.46 33.2 -397
## - HSGPA         1      6.58 37.3 -372
## 
## Step:  AIC=-414.1
## GPA ~ HSGPA + SATV + Male + HU + SS + FirstGen + White
## 
##            Df Sum of Sq  RSS  AIC
## - FirstGen  1      0.11 30.8 -415
## - Male      1      0.15 30.9 -415
## - SS        1      0.25 31.0 -414
## <none>                  30.7 -414
## - SATV      1      0.47 31.2 -413
## - White     1      1.17 31.9 -408
## - HU        1      2.45 33.2 -399
## - HSGPA     1      6.76 37.5 -373
## 
## Step:  AIC=-415.3
## GPA ~ HSGPA + SATV + Male + HU + SS + White
## 
##         Df Sum of Sq  RSS  AIC
## - Male   1      0.15 31.0 -416
## - SS     1      0.28 31.1 -415
## <none>               30.8 -415
## - SATV   1      0.59 31.4 -413
## - White  1      1.29 32.1 -408
## - HU     1      2.82 33.7 -398
## - HSGPA  1      6.64 37.5 -375
## 
## Step:  AIC=-416.2
## GPA ~ HSGPA + SATV + HU + SS + White
## 
##         Df Sum of Sq  RSS  AIC
## <none>               31.0 -416
## - SS     1      0.30 31.3 -416
## - SATV   1      0.70 31.7 -413
## - White  1      1.31 32.3 -409
## - HU     1      2.80 33.8 -399
## - HSGPA  1      6.50 37.5 -377

## 
## Call:
## lm(formula = GPA ~ HSGPA + SATV + HU + SS + White, data = GPA)
## 
## Coefficients:
## (Intercept)        HSGPA         SATV           HU           SS  
##    0.568488     0.473998     0.000748     0.016745     0.007747  
##       White  
##    0.206041

Find the regression summary for the final model. Are the coefficients significant? Are the assumptions met?

Forward Selection

Forward Selection is the opposite idea of Backwards Elimination. Here, we begin with an empty model and then greedily add terms that have a statistically significant effect.

fm.empty = lm(GPA ~ 1, data = GPA)
step(fm.empty, scope = ~. + SATV + SATM + HSGPA + Male + HU + SS + FirstGen + 
    White + CollegeBound, direction = "forward")

## Start:  AIC=-333.9
## GPA ~ 1
## 
##                Df Sum of Sq  RSS  AIC
## + HSGPA         1      9.43 37.8 -381
## + HU            1      4.68 42.6 -355
## + SATV          1      4.37 42.9 -353
## + White         1      3.75 43.5 -350
## + SATM          1      1.78 45.4 -340
## + FirstGen      1      1.16 46.1 -337
## <none>                      47.2 -334
## + CollegeBound  1      0.19 47.0 -333
## + Male          1      0.13 47.1 -333
## + SS            1      0.00 47.2 -332
## 
## Step:  AIC=-380.7
## GPA ~ HSGPA
## 
##                Df Sum of Sq  RSS  AIC
## + HU            1      3.31 34.5 -399
## + White         1      3.23 34.6 -398
## + SATV          1      2.19 35.6 -392
## + FirstGen      1      1.63 36.2 -388
## + SATM          1      0.77 37.0 -383
## + Male          1      0.41 37.4 -381
## <none>                      37.8 -381
## + CollegeBound  1      0.03 37.8 -379
## + SS            1      0.00 37.8 -379
## 
## Step:  AIC=-398.8
## GPA ~ HSGPA + HU
## 
##                Df Sum of Sq  RSS  AIC
## + White         1     2.521 32.0 -413
## + SATV          1     1.804 32.7 -409
## + SATM          1     0.860 33.6 -402
## + FirstGen      1     0.800 33.7 -402
## + Male          1     0.434 34.1 -400
## + SS            1     0.379 34.1 -399
## <none>                      34.5 -399
## + CollegeBound  1     0.039 34.5 -397
## 
## Step:  AIC=-413.4
## GPA ~ HSGPA + HU + White
## 
##                Df Sum of Sq  RSS  AIC
## + SATV          1     0.681 31.3 -416
## + FirstGen      1     0.309 31.7 -414
## <none>                      32.0 -413
## + SATM          1     0.282 31.7 -413
## + Male          1     0.279 31.7 -413
## + SS            1     0.275 31.7 -413
## + CollegeBound  1     0.049 31.9 -412
## 
## Step:  AIC=-416.1
## GPA ~ HSGPA + HU + White + SATV
## 
##                Df Sum of Sq  RSS  AIC
## + SS            1    0.2951 31.0 -416
## <none>                      31.3 -416
## + Male          1    0.1670 31.1 -415
## + FirstGen      1    0.1560 31.1 -415
## + SATM          1    0.0269 31.3 -414
## + CollegeBound  1    0.0137 31.3 -414
## 
## Step:  AIC=-416.2
## GPA ~ HSGPA + HU + White + SATV + SS
## 
##                Df Sum of Sq  RSS  AIC
## <none>                      31.0 -416
## + Male          1    0.1534 30.8 -415
## + FirstGen      1    0.1194 30.9 -415
## + SATM          1    0.0541 30.9 -415
## + CollegeBound  1    0.0188 31.0 -414

## 
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = GPA)
## 
## Coefficients:
## (Intercept)        HSGPA           HU        White         SATV  
##    0.568488     0.473998     0.016745     0.206041     0.000748  
##          SS  
##    0.007747

Stepwise Regression

Stepwise regression combines the ideas of Backwards Elimination and Forward Selection to move in both directions.

step(fm.empty, scope = ~. + SATV + SATM + HSGPA + Male + HU + SS + FirstGen + 
    White + CollegeBound, direction = "both")

## Start:  AIC=-333.9
## GPA ~ 1
## 
##                Df Sum of Sq  RSS  AIC
## + HSGPA         1      9.43 37.8 -381
## + HU            1      4.68 42.6 -355
## + SATV          1      4.37 42.9 -353
## + White         1      3.75 43.5 -350
## + SATM          1      1.78 45.4 -340
## + FirstGen      1      1.16 46.1 -337
## <none>                      47.2 -334
## + CollegeBound  1      0.19 47.0 -333
## + Male          1      0.13 47.1 -333
## + SS            1      0.00 47.2 -332
## 
## Step:  AIC=-380.7
## GPA ~ HSGPA
## 
##                Df Sum of Sq  RSS  AIC
## + HU            1      3.31 34.5 -399
## + White         1      3.23 34.6 -398
## + SATV          1      2.19 35.6 -392
## + FirstGen      1      1.63 36.2 -388
## + SATM          1      0.77 37.0 -383
## + Male          1      0.41 37.4 -381
## <none>                      37.8 -381
## + CollegeBound  1      0.03 37.8 -379
## + SS            1      0.00 37.8 -379
## - HSGPA         1      9.43 47.2 -334
## 
## Step:  AIC=-398.8
## GPA ~ HSGPA + HU
## 
##                Df Sum of Sq  RSS  AIC
## + White         1      2.52 32.0 -413
## + SATV          1      1.80 32.7 -409
## + SATM          1      0.86 33.6 -402
## + FirstGen      1      0.80 33.7 -402
## + Male          1      0.43 34.1 -400
## + SS            1      0.38 34.1 -399
## <none>                      34.5 -399
## + CollegeBound  1      0.04 34.5 -397
## - HU            1      3.31 37.8 -381
## - HSGPA         1      8.06 42.6 -355
## 
## Step:  AIC=-413.4
## GPA ~ HSGPA + HU + White
## 
##                Df Sum of Sq  RSS  AIC
## + SATV          1      0.68 31.3 -416
## + FirstGen      1      0.31 31.7 -414
## <none>                      32.0 -413
## + SATM          1      0.28 31.7 -413
## + Male          1      0.28 31.7 -413
## + SS            1      0.28 31.7 -413
## + CollegeBound  1      0.05 31.9 -412
## - White         1      2.52 34.5 -399
## - HU            1      2.60 34.6 -398
## - HSGPA         1      7.77 39.7 -368
## 
## Step:  AIC=-416.1
## GPA ~ HSGPA + HU + White + SATV
## 
##                Df Sum of Sq  RSS  AIC
## + SS            1      0.30 31.0 -416
## <none>                      31.3 -416
## + Male          1      0.17 31.1 -415
## + FirstGen      1      0.16 31.1 -415
## + SATM          1      0.03 31.3 -414
## + CollegeBound  1      0.01 31.3 -414
## - SATV          1      0.68 32.0 -413
## - White         1      1.40 32.7 -409
## - HU            1      2.50 33.8 -401
## - HSGPA         1      6.56 37.9 -376
## 
## Step:  AIC=-416.2
## GPA ~ HSGPA + HU + White + SATV + SS
## 
##                Df Sum of Sq  RSS  AIC
## <none>                      31.0 -416
## - SS            1      0.30 31.3 -416
## + Male          1      0.15 30.8 -415
## + FirstGen      1      0.12 30.9 -415
## + SATM          1      0.05 30.9 -415
## + CollegeBound  1      0.02 31.0 -414
## - SATV          1      0.70 31.7 -413
## - White         1      1.31 32.3 -409
## - HU            1      2.80 33.8 -399
## - HSGPA         1      6.50 37.5 -377

## 
## Call:
## lm(formula = GPA ~ HSGPA + HU + White + SATV + SS, data = GPA)
## 
## Coefficients:
## (Intercept)        HSGPA           HU        White         SATV  
##    0.568488     0.473998     0.016745     0.206041     0.000748  
##          SS  
##    0.007747

In-Class Activity

Perform these procedures on the CPS85 data set, and present your results to the class.