This lab on Linear Regression in R comes from p. 109-119 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

Want to follow along on your own machine? Download the rMarkdown or Jupyter Notebook version.

3.6.1 Libraries¶

The ${\tt library()}$ function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries. Here we load the ${\tt MASS}$ package, which is a very large collection of data sets and functions. We also load the ${\tt ISLR}$ package, which includes the data sets associated with this book.

library(MASS)
library(ISLR)

3.6.2 Simple Linear Regression¶

names(Boston)

This should break, R doesn't know where to look for the data

lm.fit=lm(medv~lstat)

We can tell it explicitly with this function call

lm.fit=lm(medv~lstat,data=Boston)

Now let's fit a simple linear model with ${\tt medv}$ as the response and ${\tt lstat}$ as the predictor

lm.fit=lm(medv~lstat,data=Boston)

If we type $\tt{lm.fit}$, some basic information about the model is output. For more detailed information, we use $\tt{summary(lm.fit)}$

lm.fit
summary(lm.fit)

We can use the ${\tt names()}$ function in order to find out what other pieces of information are stored in ${\tt lm.fit}$. Although we can extract these quantities by name—e.g. ${\tt lm.fit$coefficients}$—it is safer to use the extractor functions like ${\tt coef()}$ to access them.

names(lm.fit)
coef(lm.fit)

In order to obtain a confidence interval for the coefficient estimates, we can use the ${\tt confint()}$ command.

confint(lm.fit)

The ${\tt predict()}$ function can be used to produce both confidence intervals and prediction intervals for the prediction of ${\tt medv}$ for a given value of ${\tt lstat}$.

predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="confidence")
predict(lm.fit,data.frame(lstat=(c(5,10,15))), interval="prediction")

We will now plot ${\tt medv}$ and ${\tt lstat}$ along with the least squares regression line using the ${\tt plot()}$ and ${\tt abline()}$ functions.

plot(lstat,medv)
abline(lm.fit)

The ${\tt abline()}$ function can be used to draw any line, not just the least squares regression line. To draw a line with intercept ${\tt a}$ and slope ${\tt b}$, we type ${\tt abline(a,b)}$. Below we experiment with some additional settings for plotting lines and points. The ${\tt lwd=3}$ command causes the width of the regression line to be increased by a factor of 3; this works for the ${\tt plot()}$ and ${\tt lines()}$ functions also. We can also use the ${\tt pch}$ option to create different plotting symbols.

plot(lstat,medv,col="red")
plot(lstat,medv,pch=20)
plot(lstat,medv,pch="+")
plot(1:20,1:20,pch=1:20)

Four diagnostic plots are automatically produced by applying the ${\tt plot()}$ function directly to the output from ${\tt lm()}$. In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together.

We can achieve this by using the ${\tt par()}$ function, which tells R to split the display screen into separate panels so that multiple plots can be viewed simultaneously.

For example, ${\tt par(mfrow=c(2,2))}$ divides the plotting region into a 2 × 2 grid of panels.

par(mfrow=c(2,2))
plot(lm.fit)

Alternatively, we can compute the residuals from a linear regression fit using the ${\tt residuals()}$ function. The function ${\tt rstudent()}$ will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))

On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the ${\tt hatvalues()}$ function.

plot(hatvalues(lm.fit))

The ${\tt which.max()}$ function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.

which.max(hatvalues(lm.fit))

3.6.3 Multiple Linear Regression¶

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y∼x1+x2+x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

lm.fit=lm(medv~lstat+age,data=Boston)
summary(lm.fit)

The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

lm.fit=lm(medv~.,data=Boston)
summary(lm.fit)

The ${\tt vif()}$ function, part of the car package, can be used to compute variance inflation factors. Most VIFs are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the ${\tt install.packages}$ option in R.

library(car)
vif(lm.fit)

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

lm.fit1=lm(medv~.-age,data=Boston)
summary(lm.fit1)

Alternatively, the ${\tt update()}$ function can be used.

lm.fit1=update(lm.fit, ~.-age)

3.6.4 Interaction Terms¶

It is easy to include interaction terms in a linear model using the ${\tt lm()}$ function. The syntax ${\tt lstat:black}$ tells R to include an interaction term between ${\tt lstat}$ and ${\tt black}$. The syntax ${\tt lstat*age}$ simultaneously includes ${\tt lstat}$, ${\tt age}$, and the interaction term ${\tt lstat×age}$ as predictors; it is a shorthand for ${\tt lstat+age+lstat:age}$.

summary(lm(medv~lstat*age,data=Boston))

3.6.5 Non-linear Transformations of the Predictors¶

The ${\tt lm()}$ function can also accommodate non-linear transformations of the predictors. For instance, given a predictor ${\tt X}$, we can create a predictor ${\tt X2}$ using ${\tt I(X^{\wedge} 2)}$. The function ${\tt I()}$ is needed since the ^ has a special meaning in a formula; wrapping as we do allows the standard usage in R, which is to raise ${\tt X }$to the power 2. We now perform a regression of ${\tt medv}$ onto ${\tt lstat}$ and ${\tt lstat2}$.

lm.fit2=lm(medv~lstat+I(lstat^2),data=Boston)
summary(lm.fit2)

The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the ${\tt anova()}$ function to further quantify the extent to which the quadratic fit is superior to the linear fit.

lm.fit=lm(medv~lstat,data=Boston)
anova(lm.fit,lm.fit2)

Here Model 1 represents the linear submodel containing only one predictor, ${\tt lstat}$, while Model 2 corresponds to the larger quadraticmodel that has two predictors, ${\tt lstat}$ and ${\tt lstat2}$. The ${\tt anova()}$ function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.

the F-statistic is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors ${\tt lstat}$ and ${\tt lstat2}$ is far superior to the model that only contains the predictor ${\tt lstat}$. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between ${\tt medv}$ and ${\tt lstat}$.

If we type:

par(mfrow=c(2,2))
plot(lm.fit2)

then we see that when the ${\tt lstat2}$ term is included in the model, there is little discernible pattern in the residuals.

In order to create a cubic fit, we can include a predictor of the form ${\tt I(X^{\wedge}3)}$. However, this approach can start to get cumbersome for higher order polynomials. A better approach involves using the ${\tt poly()}$ function to create the polynomial within ${\tt lm()}$. For example, the following command produces a fifth-order polynomial fit:

lm.fit5=lm(medv~poly(lstat,5,raw=TRUE),data=Boston)
summary(lm.fit5)

This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant p-values in a regression fit.

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

summary(lm(medv~log(rm),data=Boston))

3.6.6 Qualitative Predictors¶

We will now examine the ${\tt Carseats}$ data, which is part of the ${\tt ISLR}$ library. We will attempt to predict ${\tt Sales}$ (child car seat sales) in 400 locations based on a number of predictors.

fix(Carseats)
names(Carseats)

The ${\tt Carseats}$ data includes qualitative predictors such as ${\tt Shelveloc}$, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor ${\tt Shelveloc}$ takes on three possible values, ${\tt Bad}$, ${\tt Medium}$, and ${\tt Good}$.

Given a qualitative variable such as ${\tt Shelveloc}$, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats)
summary(lm.fit)

The ${\tt contrasts()}$ function returns the coding that R uses for the dummy variables. Use ${\tt ?contrasts}$ to learn about other contrasts, and how to set them.

contrasts(Carseats$ShelveLoc)

Getting Credit¶

To get credit for this lab, please reply to the prompt posted to the #lab2 Slack channel.