SDS/MTH 290 Homework 5

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. In this homework, and many others in the future, we will use the following R packages:

dplyr: for data wrangling
ggplot2: for data visualization
mosaic: for basic statistical computing

You need to load these packages in your working environment. We do this with the library() function. Run the following three lines in your console.

library(dplyr)
library(tidyr)
library(ggplot2)
library(mosaic)

Note that you only need to load then with the library() function each time you relaunch RStudio.

Using R Markdown

We will be using R Markdown to create reproducible homework reports. Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.

Creating a reproducible lab report

Remember that we will be using R Markdown to create reproducible homework reports. See the following video describing how to get started with creating these reports for this homework, and all future homeworks involving R:

Basic R Markdown with an OpenIntro Lab

One-Way Basic Factorial Design

Let us start with the BF[1] example of plant trampling. Deputy director of the Pawnee Parks and Rec department, Leslie Knope, needs to know how resistant different vegetative types are to trampling so that the number of visitors can be controlled in sensitive areas. Twenty lanes of a park are established, each .5 m wide and 1.5 m long. These twenty lanes are randomly assigned to five treatments: 0, 25, 75, 200, or 500 walking passes. Each pass consists of a 70-kg individual wearing boots, walking in a natural gait. One year after trampling, the average height of the vegetation along the lanes are measured. The data from this experiment is entered into a data frame using the code below.

parks <- data.frame(passes = c(rep(0, 4), rep(25, 4), 
                            rep(75, 4), rep(200, 4), 
                            rep(500, 4)), 
                    height = c(20.7, 15.9, 17.8, 17.6, 
                            12.9, 13.4, 12.7, 9.0, 
                            11.8, 12.6, 11.4, 12.1, 
                            7.6, 9.5, 9.9, 9.0, 
                            7.8, 9.0, 8.5, 6.7)) %>%
  mutate(passes = as.factor(passes))

Parallel Dot Graph

Using the parks data frame created from the code above, we can make a parallel dot graph using the ggplot() function from the ggplot2 package.

ggplot(parks, aes(x = passes, y = height)) +
  geom_point()

What do you notice? For which treatment do the plants fair the best? We can also look at parallel dot graphs to get a sense of the validity of the S, same standard deviations, Fisher assumption.

Formal ANOVA

If we feel comfortable with that assumption, we can move forward with the formal ANOVA for BF[1] design. We will use the lm() function and save this model in an object, here we name that object mod. Note that inside of the lm() function, the model is given as response ~ factor using the formula notation, y ~ x. The data is also given by typing data = parks. The formula and the data are arguments passed to the lm() function. Next, to get the ANOVA decomposition, F-ratios, and p-values for those F-ratios by passing the anova() function our mod object.

mod <- lm(height ~ passes, data = parks)

anova(mod)

What can we conclude about the effect of foot traffic, passes, on the height of plants?

Residual Plot

After running our formal ANOVA, we should look at the model residuals to get a sense of how valid our N assumption is.

parks <- parks %>%
  mutate(residuals = residuals(mod),
         SD = sd(residuals))

Now that we have a variable saved in our data frame that contains the residuals, we can visualize these residuals with the ggplot() function. In a layer on top of our histogram, we want to add two vertical lines that cut off 1 SD above and below the mean (of zero). We can accomplish this with the geom_vline() function. The v stands for vertical. The geom_vline() function needs to know the xintercept aesthetic to know where to draw the line. We’ll make these lines blue by adding the color = "blue" argument to geom_vline().

ggplot(parks, aes(x = residuals)) +
  geom_histogram(bins = 10) +
  geom_vline(aes(xintercept = SD), color = "blue") +
  geom_vline(aes(xintercept = -SD), color = "blue")

Two-Way Basic Factorial Design

Creating a Data Frame in R

To create a data frame for data from a bf[2] design, we will want 3 variables: 1) the response variable observations, 2) the level for each observation for the first factor, and 3) the level for each observation on the second factor.

The code below creates a data frame for the piglet data. First, notice that observations on the response variable, gain are specified inside of the c() function. The c stands for concatenate, and this concatenate function makes lists in R. Next, levels of the antibiotic variable are created, again with the c() function, but also making use of the rep() function. The rep() function will repeat the first argument how ever many times you specify in the second argument. Levels of the B12 variable are created in the same way. Finally, all three of these variables are contain in a data frame with the data.frame() function, and the data frame is saved in an object called piglets.

piglets <- data.frame(gain = c(1.3, 1.19, 1.08, 
                               1.05, 1.0, 1.04,
                               1.26, 1.21, 1.19,
                               1.52, 1.56, 1.54), 
                      antibiotic = c(rep("0mg", 3), rep("40mg", 3), rep("0mg", 3), rep("40mg", 3)),
                      B12 = c(rep("0mg", 6), rep("5mg", 6)))

Explore the Data

First we get the cell averages. Note that for whichever variable you list after the |, you will get overall means by group.

favstats(gain ~ B12|antibiotic, data = piglets)

Informal Analysis

Then we look at it visually, creating a parallel dot graph for piglets data.

ggplot(piglets, aes(x = B12, y = gain, shape = antibiotic, color = antibiotic)) +
  geom_point(size = 2)

Or a parallel boxplot.

ggplot(piglets, aes(x = B12, color = antibiotic, y = gain)) + 
  geom_boxplot()

Interaction Graph

Finally, beacuse this is a bf[2] design, we will want to check for an interaction between factor 1 and factor 2. We can create an interaction graph starting with the same code above for the parallel dot graph. We will add a few more aesthetics to the ggplot() function including group = antibiotic and linetype = antibiotic. Once we have added these two aesthetics, we can add a new layer to the plot that will connect the means for each B12 group within each level of antibiotic. We do this with the geom_smooth(method = "lm", se = 0) code.

ggplot(piglets, aes(x = B12, y = new_gain, shape = antibiotic, color = antibiotic,
                    group = antibiotic, 
                    linetype = antibiotic)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = 0)

Bonus: Formal ANOVA

We can also conduct a formal two-way ANOVA for this bf[2] data using the code below. Note the antibiotic * B12 in the formula. This gives us F-ratios for the two main effects as well as the interaction.

mod2 <- lm(gain ~ antibiotic * B12, data = piglets)

anova(mod2)

Homework 5 Problems

For homework 5, complete the following exercises. When you are done, please knit your homework to an HTML file and submit the HTML file on Moodle.

Refer to the walking babies example on page 150 of your textbook for exercises 1-3. Run the code below to get the data into your R Studio environment. Note that an additional data point has been added to the control group 3 to make it a balanced design. For exercises 1-3, you will be conducting an analysis to test if there are differences in age in months when babies first walked depending on the group they were assigned to.

special <- c(9, 9.5, 9.75, 10, 13, 9.5)
control1 <- c(11, 10, 10, 11.75, 10.5, 15)
control2 <- c(11, 12, 9, 11.5, 13.25, 13)
control3 <- c(13.25, 11.5, 12, 13.5, 11.5, 12.5)

walking <- data.frame(special, control1, control2, control3) %>%
  gather(group, age, special:control3)

Create a parallel dot graph of the walking data. What do you notice? What kind of design is this? Also, using favstats(), check the standard deviations between groups (the S assumption). Is the largest SD >3 times as large as the smallest?
Conduct an ANOVA to test the research question. What do you conclude about these baby exercise programs?
Make a histogram of residuals for the model you ran in exercise 2, complete with vlines marking 1 SD above and below the mean. Using this histogram, discuss the validity of the N and Z assumptions.

Refer to the hornworm example on page 208 (problems 1 and 7) of your textbook for exercises 4-5.

Create a data frame in your R Studio environment called hornworms with the data given in problem 7 on page 209 of your textbook. For help, review the discussion above on how the piglet data was entered. What are the units for this data frame? How many observations are there?
Using the data you created in exercise 4, create an interaction graph. Refer to the interaction graph code for the piglet example above. based on your graph, is there a main effect of the first diet? Is there a main effect of the second diet? Is there an interaction of the diet given first and the diet given second?

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted by Randi Garcia from labs originally written by Mark Hansen, further adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.

In this course we will be using R packages called dplyr for data wrangling and ggplot2 for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.

These cheatsheets may come in handy throughout the semester:

Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.

Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.