This lab on the Introduction to R comes from "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2017 in python by R. Jordan Crouser at Smith College.

Loading Data

For most analyses, the first step involves importing a data set into python. For this class, a lot of the data comes from the ISLR package. Unfortunately this isn't available for python so I've exported the data to CSV to make things easier. We can use the read_csv() function from the pandas library to import it.

We begin by loading in the Auto data set.

In [ ]:
%matplotlib inline
import pandas as pd
Auto = pd.read_csv('Auto.csv')

Nothing happens when you run this, but now the data is available in your environment.

To view the data, we can either print the entire dataset by typing its name, or we can just look at the first few rows with the head() function.

In [ ]:
Auto.head()

Now that we have the data, we can begin to learn things about it. For example, if we want to know how many rows and columns the DataFrame contains:

In [ ]:
Auto.shape

This tells us that the data has 392 observations, or rows, and nine variables, or columns.

The ${\tt .dtypes}$ atribute tells us that most of the variables are numeric or integer, although the ${\tt name }$ variable is a character vector.

In [ ]:
Auto.dtypes

Summary statistics

Often, we want to know some basic things about variables in our data. Calling the describe() method on a DataFrame will give you an idea of some of the distributions of your variables.

The ${\tt describe()}$ function produces a numerical summary of each (quantitative) variable in a particular data set.

In [ ]:
Auto.describe()

The summary suggests that origin might be better thought of as a factor. It only seems to have three possible values, 1, 2 and 3. If we read the documentation about the data we will learn that these numbers correspond to where the car is from: 1. American, 2. European, 3. Japanese. So let's cast that variable into a categorical variable using using the astype() function .

In [ ]:
Auto["origin"] = Auto["origin"].astype('category')

If we want to include a summary of this variable when we call .describe(), we need to let python know we want ALL the variables (not just the quantitative ones):

In [ ]:
Auto.describe(include='all')

Or, just look at one particular statistic using mean(), std(), median(), and more using the numpy library:

In [ ]:
import numpy as np
np.mean(Auto['displacement'])

Plotting

As in R, we can use the ggplot package to produce simple graphics. ggplot has a particular syntax, which looks like this

In [ ]:
from ggplot import *
ggplot(Auto, aes(x='cylinders', y='mpg')) + \
    geom_point()

The basic idea is that you need to initialize a plot with ggplot() and then add "geoms" (short for geometric objects) to the plot. The ggplot package is based on the Grammar of Graphics, a famous book on data visualization theory. It is a way to map attributes in your data (like variables) to "aesthetics" on the plot. The parameter aes() is short for aesthetic.

For more about the ggplot2 syntax, view the documentation using the help() function. There are also great online resources for ggplot2, like ggplot from ŷhat.

In [ ]:
help(ggplot)

The cylinders variable is stored as a numeric vector, so python has treated it as quantitative. However, since there are only a small number of possible values for cylinders, one may prefer to treat it as a qualitative variable. We can turn it into a factor, again using an astype() call.

In [ ]:
Auto["cylinders"] = Auto["cylinders"].astype('category')

To view the relationship between a categorical and a numeric variable, we might want to produce boxplots. As usual, a number of options can be specified in order to customize the plots.

In [ ]:
ggplot(Auto, aes(x='cylinders', y='mpg')) + \
    geom_boxplot() + \
    xlab("Cylinders") + \
    ylab("MPG")

The geom geom_histogram() can be used to plot a histogram.

In [ ]:
ggplot(Auto, aes(x='mpg')) + \
    geom_histogram()

The function warns us that it used a default number of bins, so we should think more carefully about what value makes sense.

In [ ]:
ggplot(Auto, aes(x='mpg')) + \
    geom_histogram(binwidth=5) 

For small datasets, we might want to see all the bivariate relationships between the variables. The pandas package has a scatter_matrix() function that can do just that. (Be patient-- it takes a long time!)

In [ ]:
pd.scatter_matrix(Auto, alpha=0.2, figsize=(10, 10))

Sometimes, we might want to save a plot for use outside of our Jupyter notebook. To do this, we call the plot's save() function.

In [ ]:
p = ggplot(Auto, aes(x='mpg')) + \
    geom_histogram(binwidth=5)

p.save(filename = "histogram.png")