Introduction

So far in lab and in lecture we have been using the qplot() function in the ggplot2 package to make visualizations. But qplot() should really only be used for making quick plots, and the name suggests. There is a far more flexible function in the ggplot2 package, the ggplot(). Eventually you will want to learn how to use this function, because you can build all kind of beautiful graphics with it.

Goal: by the end of this lab, you will be able to use ggplot2 to build several different data visualizations.

Setting up

Remember: before we can use a library like ggplot2, we have to load it:

library(ggplot2)

Why use the ggplot2 package?

Advantages of ggplot2 over lattice graphics, or Base R graphics.

  • consistent underlying grammar of graphics (Wilkinson, 2005)
  • plot specification at a high level of abstraction
  • very flexible
  • theme system for polishing plot appearance (more on this later)
  • mature and complete graphics system
  • many users, active mailing list

What Is The Grammar Of Graphics?

The big idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mappings
  • geometric objects
  • statistical transformations
  • scales
  • coordinate systems
  • position adjustments
  • faceting

Using the ggplot() function in the ggplot2 package, we can specify different parts of the plot, and combine them together using the + operator.

Example Data: Housing prices

Let’s start by taking a look at some data on housing prices:

housing <- read.csv("http://www.science.smith.edu/~jcrouser/SDS192/landdata-states.csv", header = T, stringsAsFactors = F)
head(housing[1:5])
##   State region    Date Home.Value Structure.Cost
## 1    AK   West 2010.25     224952         160599
## 2    AK   West 2010.50     225511         160252
## 3    AK   West 2009.75     225820         163791
## 4    AK   West 2010.00     224994         161787
## 5    AK   West 2008.00     234590         155400
## 6    AK   West 2008.25     233714         157458

(Data from https://www.lincolninst.edu/subcenters/land-values/land-prices-by-state.asp)

The ggplot() Function

Starting with an example. Let’s say we want to make a scatterplot for the relationship between the cost of a structure and the value of the land it sits on. We might use the following qplot() code.

qplot(y = Structure.Cost, x = Land.Value, data = housing)

Now, we would like to make this same plot using the ggplot() function. Instead of starting with qplot(), we will now be using ggplot(). It’s helpful to see what the the ggplot() function produces on its own, then we will be adding to that empty plot.

ggplot(housing)

Notice that when you run that code, it produced a gray rectangle. That is exactly what it should produce! We have not yet told ggplot() what variables we’d like to map to which aesthetics, or which geometric objects we’d like it to draw.

Aesthetics Mappings and Geometric Objects

Aesthetic Mapping (aes)

In ggplot-land, aesthetic means “something you can see”. And we want to map variables to these different aesthetics. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • line type
  • size

For our example, we will map Structure.Cost to the y-axis and map Land.Value to the x-axis.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost))

Now, ggplot() has all of the information that qplot() had, but there are no points! Why? qplot() guesses which geometiric object you probably want to draw based on the variable types you are mapping, but you will have to tell ggplot() what to draw.

Geometric Objects (geom)

Geometric objects or geoms are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)
  • … and many more!

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator.

You can get a list of available geometric objects by simply typing geom_ in Rstudio and waiting. Give it a try.

Finally, we can add (+) geom_point() to our ggplot() statement to reproduce what qplot(y = Structure.Cost, x = Land.Value, data = housing) gave us.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point()

Each type of geom accepts only a subset of all aesthetics—refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function.

Points

Now, you’re ready to make your first ggplot: a scatterplot, on your own.

hp2013Q1 <- filter(housing, Date == 2013.25) 
  1. Using the smaller hp2013Q1 data created with the code above. Make a scatterplot with ggplot() showing the relationship bewteen Structure.Cost (y-axis) and Land.Value (x-axis).

Lines

A plot constructed with ggplot() can have more than one geom. In that case, the mappings established in the ggplot() call are plot defaults that can be added to or overridden. For example, we could add a fitted regression line to our plot with the geom_smooth() function. We will need to add the method = "lm" and se = 0 argument to the function to get the ususal regression line.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point() +
  geom_smooth(method = "lm", se = 0)

Recall that we can get the fitted regression equation by running the following code.

mod <- lm(Structure.Cost ~ Land.Value, data = housing)

Now, as an alternative to geom_smooth(), we could save the fitted y-values, the y-hat’s into the dataset with the predict() function, and then add the line manually with the geom_line() function.

housing <- housing %>%
  mutate(pred_SC = predict(mod))

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) + 
  geom_point() +
  geom_line(aes(y = pred_SC), color = "blue")

You can change colors in ggplot() simply by typing the name of another color. See all of the color choices here. Note that variables are mapped to aesthetics with the aes() function, while fixed aesthetics are set outside the aes() call.

Lastly, we could use the geom_abline() function to add a line with a specific intercept and slope. The code coef(mod)[1] gives is the intercept of the linear model, and coef(mod)[2] gives us the slope. These estimates could have also been entered into geom_abline() manually.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) + 
  geom_point() +
  geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2])
  1. Using geom_smooth(), add a fitted regression line to the scatterplot you made in exercise 1. Next, change the color of the regression line to what ever you want by adding the color = "nameOfcolor" argument to the geom_smooth() function.

Text

Each geom accepts a particualar set of mappings–for example geom_text() accepts a labels mapping.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost, label = State)) +
    geom_text(size = 2)

In this case, we could have also specified our label mapping inside of the geom_text() function instead of inside of the ggplot() function.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
    geom_text(aes(label = State), size = 2)

The Color and Shape Aesthetics

Other aesthetics are mapped in the same way as x and y in the previous example. We can map a third variable Home.Value to color, and region to shape. This makes our bivariate scatterplot into a multivariate data visualization.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
    geom_point(aes(color = region))

Piping into ggplot()

We could remove the missings from the region guide, but first filtering out the missing. Also notice in this example that we can pipe data directly into the ggplot() function, which means that we can use all of our dplyr functions to process the data before we create a visualization.

housing %>%
  filter(!is.na(region)) %>%
ggplot(aes(x = Land.Value, y = Structure.Cost)) +
    geom_point(aes(color = region))

As an alternative, we can change the missing region information for DC to the value “DC” in the data.

housing %>%
  mutate(region = ifelse(is.na(region), "DC", region)) %>%
ggplot(aes(x = Land.Value, y = Structure.Cost)) +
    geom_point(aes(color = region))

If we map a continuous variable to color, we get a color scale.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
    geom_point(aes(color = Home.Value))

  1. Create a plot with your smaller data set that uses the text geometric object (geom_text()) instead of points, and that maps region to the text color. Use size = 3 for your text and also remove the missing on region. Hint: use the filter() funciton from dplyr to filter out the missings and then pipe into ggplot().

Statistical Transformations

Some plot types (such as scatterplots) do not require transformations—each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:

  • for a boxplot the y values must be transformed to the median and 1.5(IQR)
  • for a smoother, the y values must be transformed into predicted values

Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar() is stat_count:

args(geom_histogram)
args(stat_bin)

Setting Statistical Transformation Arguments

Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.

For example, here is the default histogram of Home.Value:

p2 <- ggplot(housing, aes(x = Home.Value))

p2 + geom_histogram()

The binwidth looks reasonable by default, but we can change it by passing the binwidth argument to the stat_bin function:

p2 + geom_histogram(stat = "bin", bins = 160)

Changing The Statistical Transformation

Sometimes the default statistical transformation is not what you need. This is often the case with pre-summarized data. For example, let’s find the average home value in each state:

housing_means <- housing %>%
  group_by(State) %>%
  summarise(mean_HV = mean(Home.Value))      

head(housing_means)
## # A tibble: 6 x 2
##   State   mean_HV
##   <chr>     <dbl>
## 1    AK 147385.14
## 2    AL  92545.22
## 3    AR  82076.84
## 4    AZ 140755.59
## 5    CA 282808.08
## 6    CO 158175.99

And now we plot!

ggplot(housing_means, aes(x = State)) + 
    geom_bar()

Uh oh… what went wrong?

In the above example, we took binned and summarized data and asked ggplot() to bin and summarize it again (remember, geom_bar() defaults to stat = count); obviously this didn’t work like we wanted. We can fix it by telling geom_bar() to use a different statistical transformation function:

ggplot(housing_means, aes(x = State, y = mean_HV)) + 
  geom_bar(stat = "identity")

Bonus: To rotate the State labels on the x-axis, we can use the axis.text.x argument inside of the theme() function. I found this information by googling “rotating x-axis labels ggplot2”.

ggplot(housing_means, aes(x = State, y = mean_HV)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90))

  1. Make a bar chart of average Land.Values by region. You should change missings on the region variable to “DC”. Hint: before piping into group_by(), pipe into the mutate() function, and use the ifelse() code from above.
housing %>%
  mutate(region = ifelse(is.na(region), "DC", region)) %>%
  group_by(region) %>%
  summarise(mean_LV = mean(Land.Value)) %>%
ggplot(aes(x = region, y = mean_LV)) + 
  geom_bar(stat = "identity")

Faceting

  • Faceting is ggplot2 creates separate graphs for subsets of data
  • ggplot2 offers two functions for creating small multiples:
    • facet_wrap(): define subsets as the levels of a single grouping variable
    • facet_grid(): define subsets as the crossing of two grouping variables
  • Facilitates comparison among plots, not just of geoms within a plot

Example: what is the trend in housing prices in each state?

Let’s start by using a technique we already know—mapping State to color. We’ll do this for the states in the “West” region only.

West <- housing %>%
  filter(region == "West")
ggplot(West, aes(x = Date, y = Home.Value)) + 
  geom_line(aes(color = State))  

There are two problems here–there are too many states to distinguish each one by color, and the lines obscure one another.

Faceting to the rescue!

We can fix the previous plot by faceting by state rather than mapping state to color:

ggplot(West, aes(x = Date, y = Home.Value)) + 
  geom_line() +
  facet_wrap(~State)

There is also a facet_grid() function for faceting in two dimensions.


On your own

  1. Starting with the plot you created in exercise 4, add a nice y-axis label and a x-axis label that capitalizes region. Also add a title called “Mean Land Value by Region.” Try googling “adding axis labels ggplot2 stack overflow” and see the top answer.

  2. Starting with the plot you created in exercise 3. Instead of mapping region to color, re-create this visualization but making different facets by region. Hint: use the facet_wrap() function.

This lab is based on the “Introduction to R Graphics with ggplot2” workshop, which is a product of the Data Science Services team Harvard University. The original source is released under a Creative Commons Attribution-ShareAlike 4.0 Unported. This lab was adapted for SDS192: and Introduction to Data Science in Spring 2017 by R. Jordan Crouser at Smith College and then further adapted for SDS201: Statistical Methods for Undergraduate Research by Randi L. Garcia at Smith College.