So far in lab and in lecture we have been using the qplot()
function in the ggplot2
package to make visualizations. But qplot()
should really only be used for making quick plots, and the name suggests. There is a far more flexible function in the ggplot2
package, the ggplot()
. Eventually you will want to learn how to use this function, because you can build all kind of beautiful graphics with it.
Goal: by the end of this lab, you will be able to use ggplot2
to build several different data visualizations.
Remember: before we can use a library like ggplot2
, we have to load it:
library(ggplot2)
ggplot2
package?Advantages of ggplot2
over lattice
graphics, or Base R graphics.
theme
system for polishing plot appearance (more on this later)The big idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
Using the ggplot()
function in the ggplot2
package, we can specify different parts of the plot, and combine them together using the +
operator.
Housing prices
Let’s start by taking a look at some data on housing prices:
housing <- read.csv("http://www.science.smith.edu/~jcrouser/SDS192/landdata-states.csv", header = T, stringsAsFactors = F)
head(housing[1:5])
## State region Date Home.Value Structure.Cost
## 1 AK West 2010.25 224952 160599
## 2 AK West 2010.50 225511 160252
## 3 AK West 2009.75 225820 163791
## 4 AK West 2010.00 224994 161787
## 5 AK West 2008.00 234590 155400
## 6 AK West 2008.25 233714 157458
(Data from https://www.lincolninst.edu/subcenters/land-values/land-prices-by-state.asp)
ggplot()
FunctionStarting with an example. Let’s say we want to make a scatterplot for the relationship between the cost of a structure and the value of the land it sits on. We might use the following qplot()
code.
qplot(y = Structure.Cost, x = Land.Value, data = housing)
Now, we would like to make this same plot using the ggplot()
function. Instead of starting with qplot()
, we will now be using ggplot()
. It’s helpful to see what the the ggplot()
function produces on its own, then we will be adding to that empty plot.
ggplot(housing)
Notice that when you run that code, it produced a gray rectangle. That is exactly what it should produce! We have not yet told ggplot()
what variables we’d like to map to which aesthetics, or which geometric objects we’d like it to draw.
aes
)In ggplot
-land, aesthetic means “something you can see”. And we want to map variables to these different aesthetics. Examples include:
For our example, we will map Structure.Cost
to the y-axis and map Land.Value
to the x-axis.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost))
Now, ggplot()
has all of the information that qplot()
had, but there are no points! Why? qplot()
guesses which geometiric object you probably want to draw based on the variable types you are mapping, but you will have to tell ggplot()
what to draw.
geom
)Geometric objects or geoms
are the actual marks we put on a plot. Examples include:
geom_point
, for scatter plots, dot plots, etc)geom_line
, for time series, trend lines, etc)geom_boxplot
, for, well, boxplots!)A plot must have at least one geom
; there is no upper limit. You can add a geom
to a plot using the +
operator.
You can get a list of available geometric objects by simply typing geom_
in Rstudio and waiting. Give it a try.
Finally, we can add (+
) geom_point()
to our ggplot()
statement to reproduce what qplot(y = Structure.Cost, x = Land.Value, data = housing)
gave us.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point()
Each type of geom
accepts only a subset of all aesthetics—refer to the geom
help pages to see what mappings each geom
accepts. Aesthetic mappings are set with the aes()
function.
Now, you’re ready to make your first ggplot
: a scatterplot, on your own.
hp2013Q1 <- filter(housing, Date == 2013.25)
ggplot()
showing the relationship bewteen Structure.Cost
(y-axis) and Land.Value
(x-axis).A plot constructed with ggplot()
can have more than one geom
. In that case, the mappings established in the ggplot()
call are plot defaults that can be added to or overridden. For example, we could add a fitted regression line to our plot with the geom_smooth()
function. We will need to add the method = "lm"
and se = 0
argument to the function to get the ususal regression line.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point() +
geom_smooth(method = "lm", se = 0)
Recall that we can get the fitted regression equation by running the following code.
mod <- lm(Structure.Cost ~ Land.Value, data = housing)
Now, as an alternative to geom_smooth()
, we could save the fitted y-values, the y-hat’s into the dataset with the predict()
function, and then add the line manually with the geom_line()
function.
housing <- housing %>%
mutate(pred_SC = predict(mod))
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point() +
geom_line(aes(y = pred_SC), color = "blue")
You can change colors in ggplot()
simply by typing the name of another color. See all of the color choices here. Note that variables are mapped to aesthetics with the aes()
function, while fixed aesthetics are set outside the aes()
call.
Lastly, we could use the geom_abline()
function to add a line with a specific intercept and slope. The code coef(mod)[1]
gives is the intercept of the linear model, and coef(mod)[2]
gives us the slope. These estimates could have also been entered into geom_abline()
manually.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point() +
geom_abline(intercept = coef(mod)[1], slope = coef(mod)[2])
geom_smooth()
, add a fitted regression line to the scatterplot you made in exercise 1. Next, change the color of the regression line to what ever you want by adding the color = "nameOfcolor"
argument to the geom_smooth()
function.Each geom
accepts a particualar set of mappings–for example geom_text()
accepts a labels
mapping.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost, label = State)) +
geom_text(size = 2)
In this case, we could have also specified our label
mapping inside of the geom_text()
function instead of inside of the ggplot()
function.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_text(aes(label = State), size = 2)
Other aesthetics are mapped in the same way as x
and y
in the previous example. We can map a third variable Home.Value
to color, and region
to shape. This makes our bivariate scatterplot into a multivariate data visualization.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point(aes(color = region))
ggplot()
We could remove the missings from the region guide, but first filtering out the missing. Also notice in this example that we can pipe data directly into the ggplot()
function, which means that we can use all of our dplyr
functions to process the data before we create a visualization.
housing %>%
filter(!is.na(region)) %>%
ggplot(aes(x = Land.Value, y = Structure.Cost)) +
geom_point(aes(color = region))
As an alternative, we can change the missing region information for DC to the value “DC” in the data.
housing %>%
mutate(region = ifelse(is.na(region), "DC", region)) %>%
ggplot(aes(x = Land.Value, y = Structure.Cost)) +
geom_point(aes(color = region))
If we map a continuous variable to color, we get a color scale.
ggplot(housing, aes(x = Land.Value, y = Structure.Cost)) +
geom_point(aes(color = Home.Value))
geom_text()
) instead of points, and that maps region
to the text color. Use size = 3
for your text and also remove the missing on region
. Hint: use the filter()
funciton from dplyr
to filter out the missings and then pipe into ggplot()
.Some plot types (such as scatterplots) do not require transformations—each point is plotted at x
and y
coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:
boxplot
the y
values must be transformed to the median and 1.5(IQR)smoother
, the y
values must be transformed into predicted valuesEach geom
has a default statistic, but these can be changed. For example, the default statistic for geom_bar()
is stat_count
:
args(geom_histogram)
args(stat_bin)
Arguments to stat_
functions can be passed through geom_
functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.
For example, here is the default histogram of Home.Value:
p2 <- ggplot(housing, aes(x = Home.Value))
p2 + geom_histogram()
The binwidth looks reasonable by default, but we can change it by passing the binwidth
argument to the stat_bin
function:
p2 + geom_histogram(stat = "bin", bins = 160)
Sometimes the default statistical transformation is not what you need. This is often the case with pre-summarized data. For example, let’s find the average home value in each state:
housing_means <- housing %>%
group_by(State) %>%
summarise(mean_HV = mean(Home.Value))
head(housing_means)
## # A tibble: 6 x 2
## State mean_HV
## <chr> <dbl>
## 1 AK 147385.14
## 2 AL 92545.22
## 3 AR 82076.84
## 4 AZ 140755.59
## 5 CA 282808.08
## 6 CO 158175.99
And now we plot!
ggplot(housing_means, aes(x = State)) +
geom_bar()
Uh oh… what went wrong?
In the above example, we took binned and summarized data and asked ggplot()
to bin and summarize it again (remember, geom_bar()
defaults to stat = count
); obviously this didn’t work like we wanted. We can fix it by telling geom_bar()
to use a different statistical transformation function:
ggplot(housing_means, aes(x = State, y = mean_HV)) +
geom_bar(stat = "identity")
Bonus: To rotate the State
labels on the x-axis, we can use the axis.text.x
argument inside of the theme()
function. I found this information by googling “rotating x-axis labels ggplot2”.
ggplot(housing_means, aes(x = State, y = mean_HV)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90))
Land.Values
by region. You should change missings on the region
variable to “DC”. Hint: before piping into group_by()
, pipe into the mutate()
function, and use the ifelse()
code from above.housing %>%
mutate(region = ifelse(is.na(region), "DC", region)) %>%
group_by(region) %>%
summarise(mean_LV = mean(Land.Value)) %>%
ggplot(aes(x = region, y = mean_LV)) +
geom_bar(stat = "identity")
ggplot2
creates separate graphs for subsets of dataggplot2
offers two functions for creating small multiples:
facet_wrap()
: define subsets as the levels of a single grouping variablefacet_grid()
: define subsets as the crossing of two grouping variablesgeoms
within a plotLet’s start by using a technique we already know—mapping State
to color
. We’ll do this for the states in the “West” region
only.
West <- housing %>%
filter(region == "West")
ggplot(West, aes(x = Date, y = Home.Value)) +
geom_line(aes(color = State))
There are two problems here–there are too many states
to distinguish each one by color
, and the lines
obscure one another.
We can fix the previous plot by faceting
by state
rather than mapping state
to color
:
ggplot(West, aes(x = Date, y = Home.Value)) +
geom_line() +
facet_wrap(~State)
There is also a facet_grid()
function for faceting in two dimensions.
Starting with the plot you created in exercise 4, add a nice y-axis label and a x-axis label that capitalizes region. Also add a title called “Mean Land Value by Region.” Try googling “adding axis labels ggplot2 stack overflow” and see the top answer.
Starting with the plot you created in exercise 3. Instead of mapping region
to color, re-create this visualization but making different facets by region
. Hint: use the facet_wrap()
function.
This lab is based on the “Introduction to R Graphics with ggplot2
” workshop, which is a product of the Data Science Services team Harvard University. The original source is released under a Creative Commons Attribution-ShareAlike 4.0 Unported. This lab was adapted for SDS192: and Introduction to Data Science in Spring 2017 by R. Jordan Crouser at Smith College and then further adapted for SDS201: Statistical Methods for Undergraduate Research by Randi L. Garcia at Smith College.