Bootstrapping Method

In this lab, we get to try working with bootstrap distributions to estimate the standard error of sample statistics and construct confidence intervals.

Setting a seed: We will take some random samples and build bootstrap distributions in this lab, which means you should set a seed on top of each chunk. Using the command set.seed(x), where x is a number of your choosing.

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization. We will also do some sampling with replacement using the mosaic package. The data can be found in thecompanion package for OpenIntro labs, oilabs. Finally, we will use a new package called infer.

Let’s load the packages.

library(dplyr)
library(ggplot2)
library(oilabs)
library(mosaic)

The data

Earlier in the semester, you used the ames dataset to investigate real estate prices, constructing a sampling distribution for the sample mean home price. Now, we will again consider real estate data from the city of Ames, Iowa, but we will think about constructing bootstrap distributions. The details of every real estate transaction in Ames is recorded by the City Assessor’s office.

First, consider all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest.

data(ames)

After running this chunk, we should see the full population data (ames). Recall that you can run ?ames for more information about the variables.

?ames

There are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just a few of the variables: the above ground living area of the houses in square feet (area), the sale price (price) and the year the house was built (Year.Built).

We can explore the distribution of areas of homes in the population of home sales visually and with summary statistics. To obtain some summary statistics, you can use summarize() from the dplyr package, or favstats() from the mosaic package.

The summarize() function is much more flexible than favstats(). For example, recall that the quantile function is used to calculatevalues corresponding to specific percentile cutoffs in the distribution. So, quantile(x, 0.025) will yield the cutoff value for the “2.5th” percentile in the distribution of x, and quantile(x, 0.975) will yeild from value for the “97.5th” percentile. Finding these values are useful for describing the middle 95% of the distribution. We can, of course, also use quantile() to get the cutoff value for the 25th percentile (Q1) and the 75th percentile (Q3) in the distribution of x.

ames %>%
  summarize(mu = mean(Lot.Area), pop_med = median(Lot.Area), 
            sigma = sd(Lot.Area), pop_iqr = IQR(Lot.Area),
            pop_min = min(Lot.Area), pop_max = max(Lot.Area),
            pop_Q1 = quantile(Lot.Area, 0.25),  # first quartile, 25th percentile
            pop_Q3 = quantile(Lot.Area, 0.75))  # third quartile, 75th percentile

Describe the distribution of the area from the population data using an appropriate visualization and summary statistics. Do this for price as well. Make sure to include the plots and the summary statistics output in your report along with your narrative.

The unknown sampling distribution

As in the sampling distribution lab, we’ll start by taking one sample of the entire population. Here, it is nice that we have the entire popoulation, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

ames_sample <- ames %>%
  sample_n(100)

This command collects a simple random sample of size 100 from the ames dataset, and assigns the result to ames.sample. This is like going into the City Assessor’s database and pulling up the files on 100 random home sales. Working with these 100 files would be considerably simpler than working with all 2930 home sales.

Describe the distribution of area and price in this sample. How does it compare to the distributions of these variables in the population? As with the population data, create visualizations and summary statistics for each variable. Hint: Code you used in the previous exercise will also be helpful for visualizing and summarizing the sample, however be careful to not label values mu and sigma anymore since these are sample statistics, not population parameters. You can customize the labels of any of the statistics to indicate that these come from the sample (x_bar and sd, also get rid of pop).

If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean. Depending on which 100 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 6% of the population. Using the following code, you can organize yourself by creating a small dataset containing the values of the population parameters of interest and the sample statistics that estimate these parameters.

para_stats <- data.frame(variable = c("area", "price"),
                         parameter_mu = c(mean(~area, data = ames),
                                          mean(~price, data = ames)), 
                         statistic_xbar = c(mean(~area, data = ames_sample),
                                            mean(~price, data = ames_sample)))

Create a new variable in this tiny para_stats data set that measures how far off your sample estimates were from the population parameters they are estimating. Would you expect everyone in the class to have the same level of accuracy? Explain your reasoning. You might want to look over previous labs to remember how to create a new variable.

Confidence Intervals Using a Bootstrap Distribution

Next we want to find a 95% CI for the mean area for all homes sold in Ames. In order to do this, we can use this sample (ames_sample) to create a bootstrap distribution for the mean area. Recall that a boostrap sample is constructed by taking a random sample with replacement from the original sample, using the same sample size as the original sample. Once we have taken our bootstrap sample we will calculate our bootstrap statistic, which in this case will be a mean. Then we repeat that many times to form a bootstrap distribution. Finally, we find the standard deviation of the bootstrap distribution which approximates our standard error.

First, we can create one bootstrap sample using the resample() function from the mosaic package.

bstrap1 <- resample(ames_sample)

We can then use the group_by() function with the n() fcuntion in summarize() to see which observations were repeated

bstrap1 %>%
  group_by(PID) %>%
  summarise(reps = n())

Lastly, we can also use the distinct() function and nrow() to see how many different observations ended up in our bootstrap sample of 100.

bstrap1 %>%
  distinct() %>%
  nrow()

How many differnt observations from the orginal sample ended up in your bootstrap sample? Should this value be the same across students? Explain.
What is the sample mean area and price in your first bootstrap sample. How does it compare to the sample means from your original sample?

Now, finally, we do not just want one bootstrap sample mean. To get a bootstrap estimate of the standard error of the mean, we will want to pull many bootstrap samples, and calculate many sample meas from these samples. We can do that with the following code. Let us now focus exclusively on the area variable.

bstrap.mean.area <- do(1000) * mean(~area, data = resample(ames_sample))

Let’s break this down. The resample command will take a bootstrap resampling of whatever data set you put within the parantheses. We then take the mean of the variable area from the resmapled data. Finally, do(1000) tells R to do this 1,000 times, and store the mean areas with the name bstrap.mean.area.

Next we want to take a look at the bootstrap distribution to make sure we will be able to use it to find our confidence interval:

ggplot(bstrap.mean.area, aes(x = mean)) +
  geom_histogram(bins = 30) +
  xlab("bootstrap distribution of mean area")

Finally, we will find the standard error, and the 2.5th and 97.5th percentiles.

bstrap.mean.area %>%
  summarise(bs_se=sd(mean), 
            bs_025 = quantile(mean, 0.025),
            bs_975 = quantile(mean, 0.975))

Use the above code to calculate your 95% CI for the mean area of the houses in two ways: the standard error method and the percentile method. Do the two confidence intervals seem similar enough? Do the two confidence intervals both contain the population mean of 1499.69 square feet?
Next, use the ames_sample data to create a bootstrap distribution for the mean price. Calculate the 99% CI, and interpret it in context. Does you 99% CI contain the population mean of $180,796.10 dollars?

Confidence intervals are good for answering the question “how much”, but they don’t allow us to assess specific claims. For that, we need hypothesis tests.

Hypothesis Tests Using Bootstrapping

A hypothesis test allows us to use data from a sample to assess a claim about a population. The first hypothesis we want to test is that the average price of a homes in Ames is different from 170,000 dollars.

Write down the null and alternative hypotheses for this test, and be sure to use the proper notation.

In order to carry out the above hypothesis test, we will use our ames_sample to create a randomization distribution. Recall that in order to create a randomization distribution, we have to create a distribution of what the sample statistics would look like if the null hypothesis were true. For means (as opposed to proportions) this can be a little less straight forward.

Fortunately for us, one thing we know how to do is to take a bootstrap sample for the mean. But if we were to take a bootstrap sample, our distribution would not be constructed assuming the null hypothesis is true. So here’s what we can do, we center our sample data on the null value, and then sample with replacement from the newly re-centered sample data.

null<-170000   # null value

diff=null-mean(~price, data=ames_sample) #difference between null value and sample mean

rando.mean.price <- do(1000) * (mean(~price, data = resample(ames_sample)) + diff) #creating a randomization distribution centered on the null value

Now we have stored the randomization distribution, assuming that the null hypothesis is true. Our next step is to visualize the randomization distribution and our sample statistic.

qplot(x = rando.mean.price) + 
  geom_vline(xintercept = mean(~price, data=ames_sample)) +
  xlab("randomization distribution of mean price")

In this plot, the vertical line represents our sample mean. We next need to calculate the proportion of our randomization distribution that is as or more extreme than our sample statistic. To do that we can use the code below to get the percentile of our statistic:

sum(rando.mean.price < mean(~price, data=ames_sample))/1000

This code calculates the proportion of the randomization distribution that is less than our sample statistic.

Calculate your p-value for this hypothesis test (paying attention to it being a one-tailed or two-tailed test), and make a conclusion in context about your hypothises using an $\alpha=0.05$.

More Practice

So far, we have only focused on estimating the mean living area in homes, and the mean price in homes in Ames. Now we’ll work with the year the house was built Year.Built.

Using the full data set ames, create a new variable called house.age (how old the house is in 2018) using the mutate command. Find the population mean for house.age
Create a new random sample of 50 homes, and use that random sample to find the 95% CI for the average age for houses in Ames using the SE method. Interpret the CI in context.
Create a new random sample of 200 homes, and use that random sample to find the 95% CI for the average age for houses in Ames using the SE method. How does this 95% CI compare to the CI from the previous question? Why do you think you observe this/these difference/s?
Create a new random sample of 25 homes, and use that random sample to test the null hypothesis that the average age for houses in Ames is greater than 100 years. Be sure to write out your hypotheses, state your sample statistic, your p-value, and interpret your p-value in context.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.