SDS 293 - Machine Learning

Introduction

Over the past two weeks, we have considered four different classification approaches: K-nearest neighbors (KNN), logistic regression, linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA). In this lab, we'll consider the types of scenarios in which one approach might dominate the others. You'll work in teams of 2-4 people on this lab.

Conceptual (30 minutes)

We'll start by comparing and contrasting some theoretical scenarios. We'll assume for simplicity that each data set contains just two predictors, X₁ and X₂:

Scenario 1: There are 20 training observations in each of the two classes. X₁ and X₂ are uncorrelated random normal variables, with a different mean in each class.

Scenario 2: The details are the same as in Scenario 1, except that within each class X₁ and X₂ have a correlation of -0.5.

Scenario 3: The data were generated from a t distribution, and we have 50 observations per class. The t distribution has a similar shape to the normal (gaussian) distribution, but it has a tendency to yield more extreme points -- that is, more points that are far from the mean.

Scenario 4: The data were generated from a normal distribution, with a correlation of 0.5 between the predictors in the first class, and correlation of -0.5 between the predictors in the second class.

Scenario 5: Within each class, the observations were generated from a normal distribution with uncorrelated predictors. The responses were sampled from the logistic function using X₁², X₂², and X₁×X₂ as predictors.

Scenario 6: Details are the same as in the Scenario 5, but the responses were sampled from a more complicated non-linear function.

In each scenario, discuss with your team how you would expect each of the four classification methods (KNN, logistic regression, LDA, and QDA) to perform on the data that's described.

A few things to consider:

What assumptions does each method make?
Does the data described in the scenario break any of those assumptions?
What does that mean for the bias/variance tradeoff?
Is there one method that you think would perform best?

Note: these scenarios are excerpted from pages 153-154 of ISLR. I strongly recommend that you talk through the scenarios first, and then use the book to check your intuitions.

Applied (30 minutes)

If you have time after completing the conceptual part of this lab, let's leave the realm of the hypothetical and take these methods out for a drive on some real data! There are over 300 datasets available through the UCI Machine Learning Repository, many of them particularly amenable for testing out classification methods.

Each dataset contains a brief summary of some basic metadata (# instances, # attributes, whether they're numerical or categorical, whether there are missing values, etc.) as well as a text description of each attribute. For example, if we take a look at the E. Coli dataset:

Choose a dataset that is of interest to you and your team, and see what you can classify!

Deliverable

To get credit for this lab, submit your team's ideas at:

http://www.science.smith.edu/~jcrouser/SDS293/labs/lab6-responses.html

Course Number	SDS 293
Semester	Fall 2016
Hours	TH 1:00-2:20
Location	Burton 209

Instructor	R. Jordan Crouser
email	jcrouser at smith (dot) edu
Office	Ford 344
Office Hours	W 10am-noon & by appointment

	Course Description
	Schedule
	Assignments
	Labs
	Resources
	Grading
	Accommodation
	Acknowledgement

SDS 293 - Machine LearningLab 6: Comparing Classification Methods

Introduction

Conceptual (30 minutes)

Applied (30 minutes)

Deliverable

SDS 293 - Machine Learning
Lab 6: Comparing Classification Methods