This lab on K-Nearest Neighbors in R comes from p. 163-167 of “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. It was re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College.

To follow along on your own machine, you can download this lab in R Markdown format here, or you can check out the Python version here.

4.6.5: K-Nearest Neighbors

In this lab, we will perform KNN on the Smarket dataset from ISLR. This data set consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. Let’s start by loading a few useful packages (remember that you may need to install them first by using install.packages()!)

library(ISLR)
library(tidyverse)
library(mosaic)

For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date). We can use the slice() function to look at the first few rows:

data(Smarket)
Smarket %>% 
  slice(1:10)

##    Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
## 1  2001  0.381 -0.192 -2.624 -1.055  5.010 1.1913  0.959        Up
## 2  2001  0.959  0.381 -0.192 -2.624 -1.055 1.2965  1.032        Up
## 3  2001  1.032  0.959  0.381 -0.192 -2.624 1.4112 -0.623      Down
## 4  2001 -0.623  1.032  0.959  0.381 -0.192 1.2760  0.614        Up
## 5  2001  0.614 -0.623  1.032  0.959  0.381 1.2057  0.213        Up
## 6  2001  0.213  0.614 -0.623  1.032  0.959 1.3491  1.392        Up
## 7  2001  1.392  0.213  0.614 -0.623  1.032 1.4450 -0.403      Down
## 8  2001 -0.403  1.392  0.213  0.614 -0.623 1.4078  0.027        Up
## 9  2001  0.027 -0.403  1.392  0.213  0.614 1.1640  1.303        Up
## 10 2001  1.303  0.027 -0.403  1.392  0.213 1.2326  0.287        Up

Today we’re going to try to predict Direction using percentage returns from the previous two days (Lag1 and Lag2). We’ll build our model using the knn() function, which is part of the class library:

library(class)

This function works rather differently from the other model-fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs. 1. Training data (just the predictors). We’ll call this train_Market. 2. Testing data (just the predictors). We’ll call this test_Market. 3. Training data (our outcome variable, which is class labels in this case). We’ll call this train_Direction. 4. A value for K, the number of nearest neighbors to be used by the classifier.

We’ll first create two subsets of our data– one containing the observations from 2001 through 2004, which we’ll use to train the model and one with observations from 2005 on, for testing. To do this, we’ll use the dplyr filter() command.

train = Smarket %>%
  filter(Year < 2005)
test = Smarket %>%
  filter(Year >= 2005)

For the model, we need small datasets for train_Market and test_Market that only contain the predictors we want to use. Since we’re interested in Lag1 and Lag2, we need to pull those out. We can do this using select().

train_Market = train %>%
  select(Lag1, Lag2)
test_Market = test %>%
  select(Lag1, Lag2)

Because of the way the knn() function is written, we need to provide the class labels as a vector, which requires an additional step to the pipeline. By adding .$Direction to the end of our dplyr chain, we’re indicating that we want to turn the result of our select() (which is a dataframe, as dplyr makes all its outputs) into a vector.

train_Direction = train %>%
  select(Direction) %>%
  .$Direction

Now the knn() function can be used to predict the market’s movement for the dates in 2005. We set a random seed before we apply knn() because if several observations are tied as nearest neighbors, then R will randomly break the tie. Therefore, a seed must be set in order to ensure reproducibility of results. You can put in your favorite number here, or leave it as 1.

set.seed(1)
knn_pred = knn(train_Market, test_Market, train_Direction, k=1)

The table() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

test_Direction = test %>%
  select(Direction) %>%
  .$Direction
table(knn_pred, test_Direction)

##         test_Direction
## knn_pred Down Up
##     Down   43 58
##     Up     68 83

mean(knn_pred==test_Direction)

## [1] 0.5

The results using $K = 1$ are not very good, since only 50% of the observations are correctly predicted. Of course, it may be that $K = 1$ results in an overly flexible fit to the data. Below, we repeat the analysis using $K = 3$.

knn_pred3 = knn(train_Market, test_Market, train_Direction, k=3)
table(knn_pred3, test_Direction)

##          test_Direction
## knn_pred3 Down Up
##      Down   48 54
##      Up     63 87

mean(knn_pred3==test_Direction)

## [1] 0.5357143

The results have improved slightly. Let’s try a few other $K$ values to see if we get any further improvement. To do this, we’re going to use a few more of the tidyverse packages, tidyr and purrr. These are automically loaded by the convenience package tidyverse, so we don’t need to load them explicitly. But, it’s still good to know where functions come from so you can search for help effectively!

set.seed(1)

predictions = data_frame(k = 1:5) %>%
  unnest(prediction = map(k, ~ knn(train_Market, test_Market, train_Direction, k = .))) %>%
  mutate(oracle = rep_along(prediction, test_Direction))

predictions %>%
  group_by(k) %>%
  summarize(accuracy = mean(prediction == oracle))

## Source: local data frame [5 x 2]
## 
##       k  accuracy
##   (int)     (dbl)
## 1     1 0.5000000
## 2     2 0.5198413
## 3     3 0.5317460
## 4     4 0.5277778
## 5     5 0.4841270

It looks like for classifying this dataset, KNN might not be the right approach.

4.6.6: An Application to Caravan Insurance Data

Let’s see how the KNN approach performs on the Caravan data set, which is part of the ISLR package. This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals. The response variable is Purchase, which indicates whether or not a given individual purchases a caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.

tally(~Purchase, data=Caravan, format = "percent")

## 
##        No       Yes 
## 94.022673  5.977327

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

For instance, imagine a data set that contains two variables, salary and age (measured in dollars and years, respectively). As far as KNN is concerned, a difference of $1,000 in salary is enormous compared to a difference of 50 years in age. Consequently, salary will drive the KNN classification results, and age will have almost no effect.

This is contrary to our intuition that a salary difference of $1,000 is quite small compared to an age difference of 50 years. Furthermore, the importance of scale to the KNN classifier leads to another issue: if we measured salary in Japanese yen, or if we measured age in minutes, then we’d get quite different classification results from what we get if these two variables are measured in dollars and years.

A good way to handle this problem is to standardize the data so that all variables are given a mean of zero and a standard deviation of one. Then all variables will be on a comparable scale. The scale() function does just this. In standardizing the data, we exclude the qualitative Purchase variable.

standardized_Caravan = Caravan %>%
  select(-Purchase) %>%
  scale() %>%
  data.frame() # This is an artifact of using the base scale() function. There must be a tidyier way!

var(~MOSTYPE, data=Caravan)

## [1] 165.0378

var(~MAANTHUI, data=Caravan)

## [1] 0.1647078

var(~MOSTYPE, data=standardized_Caravan)

## [1] 1

var(~MAANTHUI, data=standardized_Caravan)

## [1] 1

Now every column of standardized_Caravan has a standard deviation of one and a mean of zero.

We’ll now split the observations into a test set, containing the first 1,000 observations, and a training set, containing the remaining observations.

test_Caravan = standardized_Caravan %>%
  slice(1:1000)
train_Caravan = standardized_Caravan %>%
  slice(1001:5822)

Purchase = Caravan %>%
  select(Purchase)

test_Purchase = Purchase %>%
  slice(1:1000) %>%
  .$Purchase

train_Purchase = Purchase %>%
  slice(1001:5822) %>%
  .$Purchase

Let’s fit a KNN model on the training data using $K = 1$, and evaluate its performance on the test data.

set.seed(1)
knn_pred = knn(train_Caravan, test_Caravan, train_Purchase, k=1)
mean(test_Purchase != knn_pred) # KNN error rate

## [1] 0.118

mean(test_Purchase != "No")     # Percent of people who purchase insurance

## [1] 0.059

The KNN error rate on the 1,000 test observations is just under 12%. At first glance, this may appear to be fairly good. However, since only 6% of customers purchased insurance, we could get the error rate down to 6% by always predicting No regardless of the values of the predictors!

Suppose that there is some non-trivial cost to trying to sell insurance to a given individual. For instance, perhaps a salesperson must visit each potential customer. If the company tries to sell insurance to a random selection of customers, then the success rate will be only 6%, which may be far too low given the costs involved.

Instead, the company would like to try to sell insurance only to customers who are likely to buy it. So the overall error rate is not of interest. Instead, the fraction of individuals that are correctly predicted to buy insurance is of interest.

It turns out that KNN with $K = 1$ does far better than random guessing among the customers that are predicted to buy insurance:

table(knn_pred, test_Purchase)

##         test_Purchase
## knn_pred  No Yes
##      No  873  50
##      Yes  68   9

Among 77 such customers, 9, or 11.7%, actually do purchase insurance. This is double the rate that one would obtain from random guessing. Let’s see if increasing $K$ helps! Try out a few different $K$ values below. Feeling adventurous? Write some code that figures out the best value for $K$.

set.seed(1)
# Your code here

It appears that KNN is finding some real patterns in a difficult data set! To get credit for this lab, post a response to the Piazza prompt available at:

https://piazza.com/class/isuhyl9uxya5vm?cid=10