MTH247: Introduction to Regression Analysis

MLR: Randomization and the Bootstrap – Fall 2012 – Prof. Baumer

require(mosaic)

Randomization (Permutation) Tests

Recall that the inference for linear regression relies on the assumptions that the following three conditions are met:

Linearity
Constant Variance
Normality of Residuals

We know how to assess whether these conditions are met, and we have learned a few techniques for correcting them when they are not (e.g. transformations). Today, we will learn a few techinques based on randomization for making non-parametric inferences. Such inferences do not rely on stringent assumptions about the distribution on of the residuals.

In this example, we are trying once again to explain a college college student's \( GPA \) as a function of her score on the verbal section of the SAT.

GPA = read.csv("http://www.math.smith.edu/~bbaumer/mth247/FirstYearGPA.csv")
# Use only the non-null entries
GPA = subset(GPA, !is.na(GPA))
# How big is our sample?
nrow(GPA)

## [1] 219

First, let's examine the relationship between these two variables graphically.

plotPoints(GPA ~ SATV, data = GPA, pch = 19, cex = 2, alpha = 0.3, type = c("p", 
    "r"))

plot of chunk unnamed-chunk-3

There appears to be some linear assocation between these variables, but it is not particularly strong. We can quantify this relationship using the correlation coefficient.

cor.actual = with(GPA, cor(GPA, SATV))
cor.actual

## [1] 0.3043

In this case the value of the correlation coefficient is about 0.3, which is not large, but does appear to be significantly different from 0. Recall the \( t \)-test for correlation that we learned earlier:
\[ t = r \sqrt{\frac{n-2}{1 - r^2}}, \qquad n-2 \text{ d.f.} \]

with(GPA, cor.test(GPA, SATV))

## 
##  Pearson's product-moment correlation
## 
## data:  GPA and SATV 
## t = 4.706, df = 217, p-value = 4.499e-06
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  0.1790 0.4199 
## sample estimates:
##    cor 
## 0.3043

This test confirms that the correlation between college \( GPA \) and \( SATV \) is most likely not zero, but the validity of this test requires the assumptions for simple linear regression to be met (specifically, normality). Let's assume that in this case the assumptions are not met. Can we still feel confident that the correlation is non-zero?

If \( GPA \) and \( SATV \) were really correlated, then there is a real relationship binding the \( i^{th} \) value of \( GPA \) to the \( i^{th} \) value of \( SATV \). In this case it would not make sense link the \( i^{th} \) value of \( GPA \) to the some other value of \( SATV \). But if the correlation between these two variables was in fact zero, then it wouldn't matter how we matched up the entries in the variables!

The basic idea of the permutation test is to shuffle the mapping between the two variables many times (i.e. sample without replacement), and examine the distribution of the resulting correlation coefficient. If the actual value of the correlation coefficient is a rare member of that distribution, then we have evidence that the true correlation is non-zero.

Execute the following code several times. Get a feel for how the regression line can change, depending on the permutation of the data.
Do you still believe that the slope correlation is non-zero?

GPA.permute = shuffle(GPA$GPA)
plotPoints(GPA.permute ~ SATV, data = GPA, pch = 19, cex = 2, alpha = 0.3, type = c("p", 
    "r"))

The procedure for the randomization test is simple. We simply shuffle the response variable and compute the resulting correlation coefficient with the explanatory variable. But we do this many times, until we have a sense of the distribution of that correlation coefficient. We then examine where the observed correlation falls in that distribution.

# Do this 1000 times
rtest = do(1000) * cor(shuffle(GPA$GPA), GPA$HSGPA)
densityplot(~result, data = rtest, xlim = c(-0.5, 0.5), xlab = "Correlation Coefficient")
ladd(panel.abline(v = cor.actual, col = "red", lwd = 3))

Of course, we can also explicitly find where in the distribution the observed correlation lies.

pdata(cor.actual, rtest$result)

Finally, we can find a non-parametric 95% confidence interval for the correlation coefficient. The interpretation here is that actual correlation values in this interval would not be considered statistically significant.

qdata(c(0.025, 0.975), rtest$result)

Compare this confidence interval to the one returned by the correlation test above.
Perform the above procedure to test the correlation between \( GPA \) and \( SATM \). Be prepared to present your results to the class.
Perform the above procedure to test the correlation between \( GPA \) and \( HSGPA \). Be prepared to present your results to the class.

The Bootstrap

A similar idea is the bootstrap. Here, we will repeatedly sample cases from our data set, with replacement, and construct the distribution of a statistic. The key idea here is that even though the solution to the regression model is deterministic, the data itself is assumed to be randomly sampled, and so all of the estimates that we make in a regression model are random. If the data changes, then the estimates change as well. The bootstrap gives us a non-parametric understanding of the distribution of those estimates. Once again, the advantage to this method is that we can construct meaningful confidence intervals for, say, the slope of the regression line, without having to assume that the residuals are normally distributed.

In this example we are examining the prices of Porsches. We want to estimate the \( Price \) as a function of the \( Mileage \). Our sample of 30 cars looks like this:

PP = read.csv("http://www.math.smith.edu/~bbaumer/mth247/PorschePrice.csv")
plotPoints(Price ~ Mileage, data = PP, pch = 19, cex = 2, alpha = 0.3, type = c("p", 
    "r"))

plot of chunk unnamed-chunk-10

To create a bootstrap sample, we select rows from the data frame uniformly at random, but with replacement.

idx = resample(1:nrow(PP))
rPP = PP[idx, ]
plotPoints(Price ~ Mileage, data = rPP, pch = 19, cex = 2, alpha = 0.3, type = c("p", 
    "r"))

Note the differences between this plot and the previous one. What do you notice?
Run the preceeding code several times. What is changing? What is staying the same?

The manipulate package allows us to create bootstrap sample on the fly.

bslope = function(formula, data, n) {
    # Original data Now do the bootstrap
    bootstrap = do(n) * lm(formula, data = data[resample(1:nrow(data)), ])$coefficients
    xyplot(formula, data = data, panel = function(x, y, ...) {
        panel.xyplot(x, y, pch = 19, cex = 2, alpha = 0.3)
        fm = lm(formula, data = data)
        panel.abline(fm, col = "red", lwd = 3)
        # Add the bootstrap slopes
        for (i in 1:n) {
            panel.abline(t(bootstrap[i, ]), -0.5, col = "lightgray", lwd = 0.3)
        }
        panel.text(75, 80, paste("mean intercept\n", round(mean(bootstrap$"(Intercept)"), 
            6)))
        panel.text(75, 75, paste("sd intercept\n", round(sd(bootstrap$"(Intercept)"), 
            6)))
        panel.text(75, 70, paste("mean slope\n", round(mean(bootstrap$Mileage), 
            6)))
        panel.text(75, 65, paste("sd slope\n", round(sd(bootstrap$Mileage), 
            6)))
    })
}
# bslope(Price ~ Mileage, data=PP, 100)

Play around with the following demonstration until you get a feel for what is happening.

require(manipulate)
manipulate(bslope(Price ~ Mileage, data = PP, n), n = slider(2, 500, initial = 100))

Confidence Intervals Using the Bootstrap

The primary advantage of the bootstrap is that it allows us to construct confidence intervals for the slope coefficient that are not nearly as depedent upon the conditions for linear regression being met.

The original confidence intervals for our SLR model depend upon the conditions being true.

fm = lm(Price ~ Mileage, data = PP)
confint(fm)

##               2.5 %  97.5 %
## (Intercept) 66.2360 75.9449
## Mileage     -0.7054 -0.4734

Now let's create a bootstrap distribution for the regression coefficients.

# I'm only doing 100 samples, but you should do more!
bootstrap = do(100) * lm(Price ~ Mileage, data = PP[resample(1:nrow(PP)), ])$coefficients
densityplot(~Mileage, data = bootstrap)
ladd(panel.abline(v = fm$coefficients["Mileage"], col = "red", lwd = 3))

There are three methods for constructing confidence intervals from the bootstrap.

Method #1

The first method assumes that the bootstrap distribution is normal. In this case we can use the standard deviation of the bootstrap distribution to construct a confidence interval for the slope coefficient.

c(fm$coefficients["Mileage"] - 1.96 * sd(bootstrap$Mileage), fm$coefficients["Mileage"] + 
    1.96 * sd(bootstrap$Mileage))

Method #2

The second method does not require that the bootstrap distribution be normal, but only works well when it is roughly symmetric. In this case we simply use the percentiles of the bootstrap distribution to build confidence intervals.

qdata(c(0.025, 0.975), bootstrap$Mileage)

Method #3

If the bootstrap distribution is skewed, then we need to modify the second method to work in reverse. This is because our estimate may already differ from the true population parameter.

q.ul = qdata(c(0.025, 0.975), bootstrap$Mileage)
c(2 * fm$coefficients["Mileage"] - q.ul[2], 2 * fm$coefficients["Mileage"] - 
    q.ul[1])

Create a bootstrap sample of at least 2000 and construct the three confidence intervals discussed above. How do they differ from the typical confidence interval?
Construct bootstrapped confidence intervals for the \( GPA \sim SATV \) and \( GPA \sim SATM \) models.