SDS 293 - Machine Learning
Lab 6: Comparing Classification Methods


Course Number SDS 293
Semester Fall 2016
Hours TH 1:00-2:20
Location Burton 209

Instructor R. Jordan Crouser
email jcrouser at smith (dot) edu
Office Ford 344
Office Hours W 10am-noon
& by appointment


Discussion: Piazza

Course Description
Schedule
Assignments
Labs
Resources
Grading
Accommodation
Acknowledgement

Introduction

Over the past two weeks, we have considered four different classification approaches: K-nearest neighbors (KNN), logistic regression, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). In this lab, we'll consider the types of scenarios in which one approach might dominate the others. You'll work in teams of 2-4 people on this lab.



Conceptual (30 minutes)

We'll start by comparing and contrasting some theoretical scenarios. We'll assume for simplicity that each data set contains just two predictors, X1 and X2:

Scenario 1: There are 20 training observations in each of the two classes. X1 and X2 are uncorrelated random normal variables, with a different mean in each class.
Scenario 2: The details are the same as in Scenario 1, except that within each class X1 and X2 have a correlation of -0.5.
Scenario 3: The data were generated from a t distribution, and we have 50 observations per class. The t distribution has a similar shape to the normal (gaussian) distribution, but it has a tendency to yield more extreme points -- that is, more points that are far from the mean.
Scenario 4: The data were generated from a normal distribution, with a correlation of 0.5 between the predictors in the first class, and correlation of -0.5 between the predictors in the second class.
Scenario 5: Within each class, the observations were generated from a normal distribution with uncorrelated predictors. The responses were sampled from the logistic function using X12, X22, and X1×X2 as predictors.
Scenario 6: Details are the same as in the Scenario 5, but the responses were sampled from a more complicated non-linear function.

In each scenario, discuss with your team how you would expect each of the four classification methods (KNN, logistic regression, LDA, and QDA) to perform on the data that's described.

A few things to consider:

Note: these scenarios are excerpted from pages 153-154 of ISLR. I strongly recommend that you talk through the scenarios first, and then use the book to check your intuitions.



Applied (30 minutes)

If you have time after completing the conceptual part of this lab, let's leave the realm of the hypothetical and take these methods out for a drive on some real data! There are over 300 datasets available through the UCI Machine Learning Repository, many of them particularly amenable for testing out classification methods.


Each dataset contains a brief summary of some basic metadata (# instances, # attributes, whether they're numerical or categorical, whether there are missing values, etc.) as well as a text description of each attribute. For example, if we take a look at the E. Coli dataset:






Choose a dataset that is of interest to you and your team, and see what you can classify!


Deliverable

To get credit for this lab, submit your team's ideas at:


http://www.science.smith.edu/~jcrouser/SDS293/labs/lab6-responses.html