Introduction
Over the past two weeks, we have considered four different classification approaches: K-nearest neighbors (KNN), logistic regression, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). In this lab, we'll consider the types of scenarios in which one approach might dominate the others. You'll work in teams of 2-4 people on this lab.
Conceptual (30 minutes)
We'll start by comparing and contrasting some theoretical scenarios. We'll assume for simplicity that each data set contains just two predictors, X1
and X2
:
Scenario 1: There are 20 training observations in each of the two classes.X1
andX2
are uncorrelated random normal variables, with a different mean in each class.
Scenario 2: The details are the same as in Scenario 1, except that within each classX1
andX2
have a correlation of -0.5.
Scenario 3: The data were generated from at
distribution, and we have 50 observations per class. Thet
distribution has a similar shape to the normal (gaussian) distribution, but it has a tendency to yield more extreme points -- that is, more points that are far from the mean.
Scenario 4: The data were generated from a normal distribution, with a correlation of 0.5 between the predictors in the first class, and correlation of -0.5 between the predictors in the second class.
Scenario 5: Within each class, the observations were generated from a normal distribution with uncorrelated predictors. The responses were sampled from the logistic function usingX12
,X22
, andX1×X2
as predictors.
Scenario 6: Details are the same as in the Scenario 5, but the responses were sampled from a more complicated non-linear function.
In each scenario, discuss with your team how you would expect each of the four classification methods (KNN, logistic regression, LDA, and QDA) to perform on the data that's described.
A few things to consider:
What assumptions does each method make?
Does the data described in the scenario break any of those assumptions?
What does that mean for the bias/variance tradeoff?
Is there one method that you think would perform best?
Note: these scenarios are excerpted from pages 153-154 of ISLR. I strongly recommend that you talk through the scenarios first, and then use the book to check your intuitions.
Applied (30 minutes)
If you have time after completing the conceptual part of this lab, let's leave the realm of the hypothetical and take these methods out for a drive on some real data! There are over 300 datasets available through the UCI Machine Learning Repository, many of them particularly amenable for testing out classification methods.
Each dataset contains a brief summary of some basic metadata (# instances, # attributes, whether they're numerical or categorical, whether there are missing values, etc.) as well as a text description of each attribute. For example, if we take a look at the E. Coli dataset:


Choose a dataset that is of interest to you and your team, and see what you can classify!
Deliverable
To get credit for this lab, submit your team's ideas at: