R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. In this homework, and many others in the future, we will use the following R packages:

`dplyr`

: for data wrangling`ggplot2`

: for data visualization`mosaic`

: for basic statistical computing

You need to load these packages in your working environment. We do this with the `library()`

function. Run the following three lines in your console.

```
library(dplyr)
library(tidyr)
library(ggplot2)
library(mosaic)
```

Note that you only need to *load* then with the `library()`

function each time you relaunch RStudio.

We will be using R Markdown to create reproducible homework reports. Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.

Remember that we will be using R Markdown to create reproducible homework reports. See the following video describing how to get started with creating these reports for this homework, and all future homeworks involving R:

**Basic R Markdown with an OpenIntro Lab**

Let us start with the BF[1] example of plant trampling. Deputy director of the Pawnee Parks and Rec department, Leslie Knope, needs to know how resistant different vegetative types are to trampling so that the number of visitors can be controlled in sensitive areas. Twenty lanes of a park are established, each .5 m wide and 1.5 m long. These twenty lanes are randomly assigned to five treatments: 0, 25, 75, 200, or 500 walking passes. Each pass consists of a 70-kg individual wearing boots, walking in a natural gait. One year after trampling, the average height of the vegetation along the lanes are measured. The data from this experiment is entered into a data frame using the code below.

```
parks <- data.frame(passes = c(rep(0, 4), rep(25, 4),
rep(75, 4), rep(200, 4),
rep(500, 4)),
height = c(20.7, 15.9, 17.8, 17.6,
12.9, 13.4, 12.7, 9.0,
11.8, 12.6, 11.4, 12.1,
7.6, 9.5, 9.9, 9.0,
7.8, 9.0, 8.5, 6.7)) %>%
mutate(passes = as.factor(passes))
```

Using the `parks`

data frame created from the code above, we can make a parallel dot graph using the `ggplot()`

function from the `ggplot2`

package.

```
ggplot(parks, aes(x = passes, y = height)) +
geom_point()
```

What do you notice? For which treatment do the plants fair the best? We can also look at parallel dot graphs to get a sense of the validity of the S, same standard deviations, Fisher assumption.

If we feel comfortable with that assumption, we can move forward with the formal ANOVA for BF[1] design. We will use the `lm()`

function and save this model in an object, here we name that object `mod`

. Note that inside of the `lm()`

function, the model is given as `response ~ factor`

using the formula notation, `y ~ x`

. The data is also given by typing `data = parks`

. The formula and the data are *arguments* passed to the `lm()`

function. Next, to get the ANOVA decomposition, F-ratios, and p-values for those F-ratios by passing the `anova()`

function our `mod`

object.

```
mod <- lm(height ~ passes, data = parks)
anova(mod)
```

What can we conclude about the effect of foot traffic, passes, on the height of plants?

After running our formal ANOVA, we should look at the model residuals to get a sense of how valid our N assumption is.

```
parks <- parks %>%
mutate(residuals = residuals(mod),
SD = sd(residuals))
```

Now that we have a variable saved in our data frame that contains the residuals, we can visualize these residuals with the `ggplot()`

function. In a layer on top of our histogram, we want to add two vertical lines that cut off 1 SD above and below the mean (of zero). We can accomplish this with the `geom_vline()`

function. The `v`

stands for vertical. The `geom_vline()`

function needs to know the `xintercept`

aesthetic to know where to draw the line. We’ll make these lines blue by adding the `color = "blue"`

argument to `geom_vline()`

.

```
ggplot(parks, aes(x = residuals)) +
geom_histogram(bins = 10) +
geom_vline(aes(xintercept = SD), color = "blue") +
geom_vline(aes(xintercept = -SD), color = "blue")
```

To create a data frame for data from a bf[2] design, we will want 3 variables: 1) the response variable observations, 2) the level for each observation for the first factor, and 3) the level for each observation on the second factor.

The code below creates a data frame for the piglet data. First, notice that observations on the response variable, `gain`

are specified inside of the `c()`

function. The `c`

stands for concatenate, and this concatenate function makes lists in R. Next, levels of the antibiotic variable are created, again with the `c()`

function, but also making use of the `rep()`

function. The `rep()`

function will repeat the first argument how ever many times you specify in the second argument. Levels of the B12 variable are created in the same way. Finally, all three of these variables are contain in a data frame with the `data.frame()`

function, and the data frame is saved in an object called `piglets`

.

```
piglets <- data.frame(gain = c(1.3, 1.19, 1.08,
1.05, 1.0, 1.04,
1.26, 1.21, 1.19,
1.52, 1.56, 1.54),
antibiotic = c(rep("0mg", 3), rep("40mg", 3), rep("0mg", 3), rep("40mg", 3)),
B12 = c(rep("0mg", 6), rep("5mg", 6)))
```

First we get the cell averages. Note that for whichever variable you list after the `|`

, you will get overall means by group.

`favstats(gain ~ B12|antibiotic, data = piglets)`

Then we look at it visually, creating a parallel dot graph for piglets data.

```
ggplot(piglets, aes(x = B12, y = gain, shape = antibiotic, color = antibiotic)) +
geom_point(size = 2)
```

Or a parallel boxplot.

```
ggplot(piglets, aes(x = B12, color = antibiotic, y = gain)) +
geom_boxplot()
```

Finally, beacuse this is a bf[2] design, we will want to check for an interaction between factor 1 and factor 2. We can create an interaction graph starting with the same code above for the parallel dot graph. We will add a few more aesthetics to the `ggplot()`

function including `group = antibiotic`

and `linetype = antibiotic`

. Once we have added these two aesthetics, we can add a new layer to the plot that will connect the means for each B12 group within each level of antibiotic. We do this with the `geom_smooth(method = "lm", se = 0)`

code.

```
ggplot(piglets, aes(x = B12, y = new_gain, shape = antibiotic, color = antibiotic,
group = antibiotic,
linetype = antibiotic)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = 0)
```

We can also conduct a formal two-way ANOVA for this bf[2] data using the code below. Note the `antibiotic * B12`

in the formula. This gives us F-ratios for the two main effects as well as the interaction.

```
mod2 <- lm(gain ~ antibiotic * B12, data = piglets)
anova(mod2)
```

For homework 5, complete the following exercises. When you are done, please knit your homework to an HTML file and submit the HTML file on Moodle.

Refer to the walking babies example on page 150 of your textbook for exercises 1-3. Run the code below to get the data into your R Studio environment. Note that an additional data point has been added to the control group 3 to make it a balanced design. For exercises 1-3, you will be conducting an analysis to test if there are differences in age in months when babies first walked depending on the group they were assigned to.

```
special <- c(9, 9.5, 9.75, 10, 13, 9.5)
control1 <- c(11, 10, 10, 11.75, 10.5, 15)
control2 <- c(11, 12, 9, 11.5, 13.25, 13)
control3 <- c(13.25, 11.5, 12, 13.5, 11.5, 12.5)
walking <- data.frame(special, control1, control2, control3) %>%
gather(group, age, special:control3)
```

Create a parallel dot graph of the walking data. What do you notice? What kind of design is this? Also, using

`favstats()`

, check the standard deviations between groups (the S assumption). Is the largest SD >3 times as large as the smallest?Conduct an ANOVA to test the research question. What do you conclude about these baby exercise programs?

Make a histogram of residuals for the model you ran in exercise 2, complete with

`vlines`

marking 1 SD above and below the mean. Using this histogram, discuss the validity of the N and Z assumptions.

Refer to the hornworm example on page 208 (problems 1 and 7) of your textbook for exercises 4-5.

Create a data frame in your R Studio environment called

`hornworms`

with the data given in problem 7 on page 209 of your textbook. For help, review the discussion above on how the`piglet`

data was entered. What are the units for this data frame? How many observations are there?Using the data you created in exercise 4, create an interaction graph. Refer to the interaction graph code for the

`piglet`

example above. based on your graph, is there a main effect of the first diet? Is there a main effect of the second diet? Is there an interaction of the diet given first and the diet given second?

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted by Randi Garcia from labs originally written by Mark Hansen, further adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.

In this course we will be using R packages called `dplyr`

for data wrangling and `ggplot2`

for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.

These cheatsheets may come in handy throughout the semester:

Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.

Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.