This lab on K-Nearest Neighbors in R comes from p. 163-167 of "Introduction to Statistical Learning with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in tidyverse
format by Amelia McNamara and R. Jordan Crouser at Smith College.
Want to follow along on your own machine? Download the rMarkdown or Jupyter Notebook version.
In this lab, we will perform KNN clustering on the Smarket
dataset from ISLR
. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the
beginning of 2001 until the end of 2005.
For each date, we have recorded the percentage returns for each of the five previous trading days (Lag1
through Lag5
).
We have also recorded:
Volume
(the number of shares traded on the previous day, in billions)Today
(the percentage return on the date in question) Direction
(whether the market was Up
or Down
on this date). We can use the head(...)
function to look at the first few rows:
library(ISLR)
head(Smarket)
Today we're going to try to predict Direction
using percentage returns from the previous two days (Lag1
and Lag2
).
We'll build our model using the knn()
function, which is part of the
class
library:
library(class)
?knn
This function works a bit differently from the other model-fitting
functions that we have encountered thus far. Rather than a two-step
approach in which we first fit the model and then we use the model to make
predictions, knn()
forms predictions using a single command. The function
requires four inputs.
train_Market
.test_Market
.train_Direction
.K
, the number of nearest neighbors to be used by the classifier.We'll first create two subsets of our data -- one containing the observations from 2001 through 2004, which we'll use to train the model and one with observations from 2005 on, for testing. Since we're only interested in Lag1
and Lag2
, we'll want to pull those out. To do this, we'll use the dplyr filter()
command and select()
commands:
library(dplyr)
train_Market = Smarket %>%
filter(Year < 2005) %>%
select(Lag1, Lag2)
test_Market = Smarket %>%
filter(Year >= 2005)%>%
select(Lag1, Lag2)
Now we just need to pull out the outcome variable for the training data. The knn()
function expects us to provide the class labels as a vector rather than a dataframe
, which we can specify by adding .$Direction
to the end of our dplyr
chain:
train_Direction = Smarket %>%
filter(Year < 2005) %>%
select(Direction) %>%
.$Direction
Now the knn()
function can be used to predict the market’s movement for
the dates in 2005. We set a random seed before we apply knn()
because
if several observations are tied as nearest neighbors, then R
will randomly
break the tie. Therefore, a seed must be set in order to ensure reproducibility
of results.
set.seed(1)
knn_pred = knn(train_Market, test_Market, train_Direction, k = 1)
The table()
function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.
# Pull out the true responses for the test data
test_Direction = Smarket %>%
filter(Year >= 2005) %>%
select(Direction) %>%
.$Direction
table(knn_pred, test_Direction)
mean(knn_pred == test_Direction)
The results using $K = 1$ are not very good, since only 50% of the observations are correctly predicted. Of course, it may be that $K = 1$ results in an overly flexible fit to the data. Below, we repeat the analysis using $K = 3$.
set.seed(1)
knn_pred = knn(train_Market, test_Market, train_Direction, k = 3)
table(knn_pred, test_Direction)
mean(knn_pred == test_Direction)
The results have improved slightly. Let's try a few other $K$ values to see if we get any further improvement:
for (k in 4:10) {
set.seed(1)
# Your code here
}
It looks like for classifying this dataset, KNN
might not be the right approach.
Let's see how the KNN
approach performs on the Caravan
data set, which is
part of the ISLR
library. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals.
The response variable is
Purchase
, which indicates whether or not a given individual purchases a
caravan insurance policy. In this data set, only 6% of people purchased
caravan insurance.
Caravan %>%
select(Purchase) %>%
summary()
Because the KNN
classifier predicts the class of a given test observation by
identifying the observations that are nearest to it, the scale of the variables
matters. Any variables that are on a large scale will have a much larger
effect on the distance between the observations, and hence on the KNN
classifier, than variables that are on a small scale.
For instance, imagine a
data set that contains two variables, salary and age (measured in dollars
and years, respectively). As far as KNN
is concerned, a difference of \$1,000
in salary is enormous compared to a difference of 50 years in age. Consequently,
salary will drive the KNN
classification results, and age will have
almost no effect.
This is contrary to our intuition that a salary difference
of \$1,000 is quite small compared to an age difference of 50 years. Furthermore,
the importance of scale to the KNN
classifier leads to another issue:
if we measured salary in Japanese yen, or if we measured age in minutes,
then we’d get quite different classification results from what we get if these
two variables are measured in dollars and years.
A good way to handle this problem is to standardize the data so that all
variables are given a mean of zero and a standard deviation of one. Then
all variables will be on a comparable scale. The scale()
function does just
this. In standardizing the data, we exclude the qualitative Purchase
variable.
standardized_Caravan = Caravan %>%
select(-Purchase) %>%
scale() %>%
data.frame() # The scale() funtion doesn't return a dataframe, so we need to do that manually
Now every column of standardized_Caravan
has a standard deviation of one and
a mean of zero.
We'll now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations.
test_Caravan = standardized_Caravan %>%
slice(1:1000)
train_Caravan = standardized_Caravan %>%
slice(1001:5822)
Purchase = Caravan %>%
select(Purchase)
test_Purchase = Purchase %>%
slice(1:1000) %>%
.$Purchase
train_Purchase = Purchase %>%
slice(1001:5822) %>%
.$Purchase
Let's fit a KNN
model on the training data using $K = 1$, and evaluate its
performance on the test data.
set.seed(1)
knn_pred = knn(train_Caravan, test_Caravan, train_Purchase, k=1)
mean(test_Purchase != knn_pred) # KNN error rate
mean(test_Purchase != "No") # Percent of people who purchase insurance
The KNN error rate on the 1,000 test observations is just under 12%. At first glance, this may appear to be fairly good. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting No
regardless of the values of the predictors!
Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only 6%, which may be far too low given the costs involved.
Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.
It turns out that KNN
with $K = 1$ does far better than random guessing
among the customers that are predicted to buy insurance:
table(knn_pred, test_Purchase)
Among 77 such customers, 9, or 11.7%, actually do purchase insurance. This is double the rate that one would obtain from random guessing. Let's see if increasing $K$ helps! Try out a few different $K$ values below. Feeling adventurous? Write a function that figures out the best value for $K$.
set.seed(1)
# Your code here
It appears that KNN
is finding some real patterns in a difficult data set!
To get credit for this lab, post a response to the prompt posted to #lab3.