This lab on the Introduction to R comes from "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2017 in python by R. Jordan Crouser at Smith College.
For most analyses, the first step involves importing a data set into python
. For this class, a lot of the data comes from the ISLR
package. Unfortunately this isn't available for python
so I've exported the data to CSV to make things easier. We can use the read_csv()
function from the pandas
library to import it.
We begin by loading in the Auto
data set.
%matplotlib inline
import pandas as pd
Auto = pd.read_csv('Auto.csv')
Nothing happens when you run this, but now the data is available in your environment.
To view the data, we can either print the entire dataset by typing its name, or we can just look at the first few rows with the head()
function.
Auto.head()
Now that we have the data, we can begin to learn things about it. For example, if we want to know how many rows and columns the DataFrame contains:
Auto.shape
This tells us that the data has 392 observations, or rows, and nine variables, or columns.
The ${\tt .dtypes}$ atribute tells us that most of the variables are numeric or integer, although the ${\tt name }$ variable is a character vector.
Auto.dtypes
Often, we want to know some basic things about variables in our data. Calling the describe()
method on a DataFrame will give you an idea of some of the distributions of your variables.
The ${\tt describe()}$ function produces a numerical summary of each (quantitative) variable in a particular data set.
Auto.describe()
The summary suggests that origin
might be better thought of as a factor. It only seems to have three possible values, 1
, 2
and 3
. If we read the documentation about the data we will learn that these numbers correspond to where the car is from: 1. American, 2. European, 3. Japanese. So let's cast that variable into a categorical variable using using the astype()
function .
Auto["origin"] = Auto["origin"].astype('category')
If we want to include a summary of this variable when we call .describe()
, we need to let python
know we want ALL the variables (not just the quantitative ones):
Auto.describe(include='all')
Or, just look at one particular statistic using mean()
, std()
, median()
, and more using the numpy
library:
import numpy as np
np.mean(Auto['displacement'])
As in R
, we can use the ggplot
package to produce simple graphics. ggplot
has a particular syntax, which looks like this
from ggplot import *
ggplot(Auto, aes(x='cylinders', y='mpg')) + \
geom_point()
The basic idea is that you need to initialize a plot with ggplot()
and then add "geoms" (short for geometric objects) to the plot. The ggplot
package is based on the Grammar of Graphics, a famous book on data visualization theory. It is a way to map attributes in your data (like variables) to "aesthetics" on the plot. The parameter aes()
is short for aesthetic.
For more about the ggplot2
syntax, view the documentation using the help()
function. There are also great online resources for ggplot2
, like ggplot from ŷhat.
help(ggplot)
The cylinders
variable is stored as a numeric vector, so python
has treated it
as quantitative. However, since there are only a small number of possible
values for cylinders, one may prefer to treat it as a qualitative variable.
We can turn it into a factor, again using an astype()
call.
Auto["cylinders"] = Auto["cylinders"].astype('category')
To view the relationship between a categorical and a numeric variable, we might want to produce boxplots. As usual, a number of options can be specified in order to customize the plots.
ggplot(Auto, aes(x='cylinders', y='mpg')) + \
geom_boxplot() + \
xlab("Cylinders") + \
ylab("MPG")
The geom geom_histogram()
can be used to plot a histogram.
ggplot(Auto, aes(x='mpg')) + \
geom_histogram()
The function warns us that it used a default number of bins, so we should think more carefully about what value makes sense.
ggplot(Auto, aes(x='mpg')) + \
geom_histogram(binwidth=5)
For small datasets, we might want to see all the bivariate relationships between the variables. The pandas
package has a scatter_matrix()
function that can do just that. (Be patient-- it takes a long time!)
pd.scatter_matrix(Auto, alpha=0.2, figsize=(10, 10))
Sometimes, we might want to save a plot for use outside of our Jupyter notebook. To do this, we call the plot's save()
function.
p = ggplot(Auto, aes(x='mpg')) + \
geom_histogram(binwidth=5)
p.save(filename = "histogram.png")