Introduction

So far in lab and in lecture we have been using the qplot() function in the ggplot2 package to make visualizations. But qplot() should really only be used for making quick plots, and the name suggests. There is a far more flexible function in the ggplot2 package, the ggplot(). Eventually you will want to learn how to use this function, because you can build all kind of beautiful graphics with it.

Goal: by the end of this lab, you will be able to use ggplot2 to build several different data visualizations.

Setting up

Remember: before we can use a library like ggplot2, we have to load it:

library(ggplot2)

Why use the ggplot2 package?

Advantages of ggplot2 over lattice graphics, or Base R graphics.

  • consistent underlying grammar of graphics (Wilkinson, 2005)
  • plot specification at a high level of abstraction
  • very flexible
  • theme system for polishing plot appearance (more on this later)
  • mature and complete graphics system
  • many users, active mailing list

What Is The Grammar Of Graphics?

The big idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:

  • data
  • aesthetic mappings
  • geometric objects
  • statistical transformations
  • scales
  • coordinate systems
  • position adjustments
  • faceting

Using the ggplot() function in the ggplot2 package, we can specify different parts of the plot, and combine them together using the + operator.

Example Data: Housing prices

Let’s start by taking a look at some data on housing prices:

housing <- read.csv("http://www.science.smith.edu/~jcrouser/SDS192/landdata-states.csv", header = T, stringsAsFactors = F)
head(housing[1:5])
##   State region    Date Home.Value Structure.Cost
## 1    AK   West 2010.25     224952         160599
## 2    AK   West 2010.50     225511         160252
## 3    AK   West 2009.75     225820         163791
## 4    AK   West 2010.00     224994         161787
## 5    AK   West 2008.00     234590         155400
## 6    AK   West 2008.25     233714         157458

(Data from https://www.lincolninst.edu/subcenters/land-values/land-prices-by-state.asp)

The ggplot() Function

Starting with an example. Let’s say we want to make a scatterplot for the relationship between the cost of a structure and the value of the land it sits on. We might use the following qplot() code.

qplot(y = Structure.Cost, x = Land.Value, data = housing)

Now, we would like to make this same plot using the ggplot() function. Instead of starting with qplot(), we will now be using ggplot(). It’s helpful to see what the the ggplot() function produces on its own, then we will be adding to that empty plot.

ggplot(housing)

Notice that when you run that code, it produced a gray rectangle. That is exactly what it should produce! We have not yet told ggplot() what variables we’d like to map to which aesthetics, or which geometric objects we’d like it to draw.

Aesthetics Mappings and Geometric Objects

Aesthetic Mapping (aes)

In ggplot-land, aesthetic means “something you can see”. And we want to map variables to these different aesthetics. Examples include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • line type
  • size

For our example, we will map Structure.Cost to the y-axis and map Land.Value to the x-axis.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost))

Now, ggplot() has all of the information that qplot() had, but there are no points! Why? qplot() guesses which geometiric object you probably want to draw based on the variable types you are mapping, but you will have to tell ggplot() what to draw.

Geometric Objects (geom)

Geometric objects or geoms are the actual marks we put on a plot. Examples include:

  • points (geom_point, for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)
  • … and many more!

A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator.

You can get a list of available geometric objects by simply typing geom_ in Rstudio and waiting. Give it a try.

Finally, we can add (+) geom_point() to our ggplot() statement to reproduce what qplot(y = Structure.Cost, x = Land.Value, data = housing) gave us.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point()

Each type of geom accepts only a subset of all aesthetics—refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function.

Points

Now, you’re ready to make your first ggplot: a scatterplot, on your own.

hp2013Q1 <- filter(housing, Date == 2013.25) 
  1. Using the smaller hp2013Q1 data created with the code above. Make a scatterplot with ggplot() showing the relationship bewteen Structure.Cost (y-axis) and Land.Value (x-axis).

Lines

A plot constructed with ggplot() can have more than one geom. In that case, the mappings established in the ggplot() call are plot defaults that can be added to or overridden. For example, we could add a fitted regression line to our plot with the geom_smooth() function. We will need to add the method = "lm" and se = 0 argument to the function to get the ususal regression line.

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
  geom_point() +
  geom_smooth(method = "lm", se = 0)

Recall that we can get the fitted regression equation by running the following code.

mod <- lm(Structure.Cost ~ Land.Value, data = housing)

Now, as an alternative to geom_smooth(), we could save the fitted y-values, the y-hat’s into the dataset with the predict() function, and then add the line manually with the geom_line() function.

housing <- housing %>%
  mutate(pred_SC = predict(mod))

ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) + 
  geom_point() +
  geom_line(aes(y = pred_SC), color = "blue")