SDS/MTH 290 Homework 3

The RStudio Interface

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.

As the class progresses, you are encouraged to explore beyond what the homework dictates; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Let’s begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

Go ahead and launch RStudio. You should see a window that looks like the image shown below.

The panel on the lower left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.

The lower right corner is where you can browse your files, access help, manage packages, etc.

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. In this homework, and many others in the future, we will use the following R packages:

dplyr: for data wrangling
ggplot2: for data visualization
oilabs: for data and custom functions with the OpenIntro labs

If these packages are not already available in your R environment, install them by typing the following lines of code into the console of your RStudio session, pressing the enter/return key after each one. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.

install.packages("dplyr")
install.packages("ggplot2")

#Installing oilabs takes a bit more code. Type the following 3 lines.
install.packages("devtools")
library(devtools)
install_github("OpenIntroOrg/oilabs")

Next, you need to load these packages in your working environment. We do this with the library() function. Run the following three lines in your console.

library(dplyr)
library(ggplot2)
library(oilabs)

Note that you only need to install packages once, but you need to load then with the library() function each time you relaunch RStudio.

Creating a reproducible lab report

We will be using R Markdown to create reproducible homework reports. See the following videos describing why and how:

Why use R Markdown for Lab Reports?

Using R Markdown for Lab Reports in RStudio

Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.

Using R Markdown

Remember that we will be using R Markdown to create reproducible homework reports. See the following video describing how to get started with creating these reports for this homework, and all future homeworks involving R:

Basic R Markdown with an OpenIntro Lab

Informal ANOVA Example

The Yarn Breaks Data

To demonstrate informal ANOVA, we will use the yarn breaks data set, called warpbreaks. (To roll with the “knitting” theme!). Convieniently, this data is already loaded into R. It exists in the background. To look at it, simply run the following command.

View(warpbreaks)

There are two structural factors: wool type (2 levels: A and B), and tension (3 levels: L, M, and H). L, M, and H, corresponds to “low”, “medium”, and “high”, respectively. These two factors are crossed, creating 6 conditions, or cells. There are 9 looms for each of the 6 conditions, for a total of 54 looms tested. The response variable is called breaks. This is the number of breaks for each loom.

You can read more about the data set using the following command:

?warpbreaks

What kind of design is this? A two-way basic factorial design, BF[2].

If you would like to get it into your enviroment, use the data function. It will first show up as <Promise>, but as soon as you start using it, it will appear as a data frame.

data(warpbreaks)

Paralell Dot Graph

Paralell Dot Graphs are used to get a sense of the variability within and bewteen the six conditions. We will use the ggplot2 package to make our graph.

ggplot(warpbreaks, aes(x = tension, y = breaks, color = wool)) +
  geom_point()

Alternatively, you can make a side-by-side boxplot to view the within and between variability. Each type of visualization has its pros and cons.

ggplot(warpbreaks, aes(x = tension, y = breaks, color = wool)) +
  geom_boxplot()

As can be seen in both figures, the most breaks occur for the looms using low tension with wool type A. However, it also looks like there is more variability for this condition. This is a potential violation of the Fisher assumtion of the Same standard deviations (S). Further, the largest variation goes with the group that has the largest median.

Let’s compute some summary statistics to get a better idea of spreads for each of the six conditions.

 sum_stats <- warpbreaks %>%
  group_by(tension, wool) %>%
  summarise(mean = mean(breaks),
            sd = sd(breaks),
            min = min(breaks),
            max = max(breaks),
            range = max - min) %>%
  arrange(range)

 sum_stats

Next, we can check how many times bigger the largest spread is than the smallest spread. We can use the range, and standard deviation as our measures of spread with the following two lines of code, respectively.

sum_stats[6,7]/sum_stats[1,7]

sum_stats[6,4]/sum_stats[1,4]

The range is 3 times as big and the standard deviation is 3.70 times as big.

Homework 3 Problems

For homework 3, complete the following exercises. When you are done, please knit your homework to an HTML file and submit the HTML file on Moodle.

List the structural factors for the data set in Fig. 2.12 (above). No need to draw the factor diagram.
Create a parallel dot graph, and a side-by-side boxplot for the data in Fig 2.12 (run the code provided below to get the data into R).

hamster <- rep(c(4, 1, 7, 2, 3, 8, 5, 6), 2)
day_length <- c(rep("long", 4), rep("short", 4), rep("long", 4), rep("short", 4))
organ <- c(rep("heart", 8), rep("brain", 8))
enzyme <- c(246, 169, 186, 183, 468, 390, 314, 508, 394, 297, 216, 373, 216, 198, 194, 192)

hamsters <- data.frame(hamster, day_length, organ, enzyme)

Compute the average and standard deviation using the same four groups as in Section 3 (Table 2.2) and comment on what you find. Are the four ranges roughly equal? Does the range seem to be related to the average by a pattern?
I find it helps me remember the Fisher assumptions if I remember the nonsense word CA-SINZ, California SINZ. Each letter stands for an assumption: Constant effects, Additivity, Same standard deviation, Independence, Normality, and Zero average for errors. List the four letters for assumptions about residuals.
Explain how the assumption of constant effects (C) is part of the justification for computing averages.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted by Randi Garcia from labs originally written by Mark Hansen, further adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.

In this course we will be using R packages called dplyr for data wrangling and ggplot2 for data visualization. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with ggplot2”.

These cheatsheets may come in handy throughout the semester:

Chester Ismay has put together a resource for new users of R, RStudio, and R Markdown here. It includes examples showing working with R Markdown files in RStudio recorded as GIFs.

Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful throughout the semester.