Introduction

So far I have used the qplot() function in the ggplot2 package to make visualizations. But there is a far more flexible function in the ggplot2 package, the ggplot(). Eventually you will want to learn how to use this function, because you can build all kinds of beautiful graphics with it.

Goal: by the end of this lab, you will be able to use ggplot2 to build several different data visualizations.

Setting up

Remember: before we can use a library like ggplot2, we have to load it:

library(ggplot2)

Why use the ggplot2 package?

  • consistent underlying grammar of graphics (Wilkinson, 2005)
  • plot specification at a high level of abstraction
  • very flexible
  • theme system for polishing plot appearance (more on this later)
  • mature and complete graphics system
  • many users, active mailing list

What Is The Grammar Of Graphics?

The big idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mappings
  • geometric objects
  • statistical transformations
  • scales
  • coordinate systems
  • position adjustments
  • faceting

Using the ggplot() function in the ggplot2 package, we can specify different parts of the plot, and combine them together using the + operator.

Example Data: Housing prices

Let’s start by taking a look at some data on housing prices:

housing <- read.csv("http://www.science.smith.edu/~jcrouser/SDS192/landdata-states.csv", 
                    header = T, stringsAsFactors = F)
head(housing[1:5])
##   State region    Date Home.Value Structure.Cost
## 1    AK   West 2010.25     224952         160599
## 2    AK   West 2010.50     225511         160252
## 3    AK   West 2009.75     225820         163791
## 4    AK   West 2010.00     224994         161787
## 5    AK   West 2008.00     234590         155400
## 6    AK   West 2008.25     233714         157458

The ggplot() Function

Starting with an example. Let’s say we want to make a scatterplot for the relationship between the cost of a structure and the value of the land it sits on. We might use the following qplot() code.

qplot(y = Structure.Cost, x = Land.Value, data = housing)

Now, we would like to make this same plot using the ggplot() function. Instead of starting with qplot(), we will now be using ggplot(). It’s helpful to see what the the ggplot() function produces on its own, then we will be adding to that empty plot.

ggplot(housing)

Notice that when you run that code, it produced a gray rectangle. That is exactly what it should produce! We have not yet told ggplot() what variables we’d like to map to which aesthetics, or which geometric objects we’d like it to draw.

Aesthetics Mappings and Geometric Objects

Aesthetic Mapping (aes)

In ggplot-land, aesthetic means “something you can see”. And we want to map variables to these different aesthetics. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • line type
  • size

For our example, we will map Structure.Cost to the y-axis and map Land.Value to the x-axis.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost))

Now, ggplot() has all of the information that qplot() had, but there are no points! Why? qplot() guesses which geometiric object you probably want to draw based on the variable types you are mapping, but you will have to tell ggplot() what to draw.

Geometric Objects (geom)

Geometric objects or geoms are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)
  • … and many more!

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator.

You can get a list of available geometric objects by simply typing geom_ in Rstudio and waiting. Give it a try.

Finally, we can add (+) geom_point() to our ggplot() statement to reproduce what qplot(y = Structure.Cost, x = Land.Value, data = housing) gave us.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point()

Each type of geom accepts only a subset of all aesthetics—refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function.

Points

Now, you’re ready to make your first ggplot: a scatterplot.

hp2013Q1 <- filter(housing, Date == 2013.25) 
  1. Using the smaller hp2013Q1 data created with the code above. Make a scatterplot with ggplot() showing the relationship bewteen Structure.Cost (y-axis) and Land.Value (x-axis).

Adding a Fixed Color

You can change the color of a geom by adding color = and then any of the allowed colors in double quotes.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point(color = "red")

You can change colors in ggplot() simply by typing the name of another color. See all of the color choices here.

  1. Change the color of the points in the scatterplot you made in exercise 1 to what ever you want by adding the color = "nameOfcolor" argument to the geom_point() function.

Mapping Variables to Color

Other aesthetics are mapped in the same way as x and y in the previous example. We can map a third variable region to color. This makes our bivariate scatterplot into a multivariate data visualization.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost, color = region)) +
    geom_point()

  1. Adding to the plot you made in exercise 1, create a plot with your smaller data set that maps region to color.

Faceting

  • Faceting is ggplot2 creates separate graphs for subsets of data
  • ggplot2 offers two functions for creating these subsets:
    • facet_wrap(): define subsets as the levels of a single grouping variable
    • facet_grid(): define subsets as the crossing of two grouping variables
  • Facilitates comparison among plots, not just of geoms within a plot

Example: what is the trend in housing prices in each state?

Let’s start by using a technique we already know—mapping State to color. We’ll do this for the states in the “West” region only.

West <- housing %>%
  filter(region == "West")
ggplot(West, aes(x = Date, y = Home.Value, color = State)) + 
  geom_line()  

There are two problems here–there are too many states to distinguish each one by color, and the lines obscure one another.

Faceting to the rescue!

We can fix the previous plot by faceting by state rather than mapping state to color:

ggplot(West, aes(x = Date, y = Home.Value)) + 
  geom_line() +
  facet_wrap(~State)

There is also a facet_grid() function for faceting in two dimensions.

  1. Starting with the plot you created in exercise 3. Instead of mapping region to color, re-create this visualization but making different facets by region. Hint: use the facet_wrap() function.

This lab is based on the “Introduction to R Graphics with ggplot2” workshop, which is a product of the Data Science Services team Harvard University. The original source is released under a Creative Commons Attribution-ShareAlike 4.0 Unported. This lab was adapted for SDS192: and Introduction to Data Science in Spring 2017 by R. Jordan Crouser at Smith College and then further adapted for SDS201: Statistical Methods for Undergraduate Research by Randi L. Garcia at Smith College.