This lab on Multiclass Support Vector Machines in R is an adaptation of p. 366-368 of “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016) and re-implemented in Fall 2016 in tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College.

To follow along on your own machine, you can download this lab in R Markdown format here, or you can check out the Python version here.

library(e1071)

Below is the dataset we generated during the previous lab:

set.seed(1)
x = matrix(rnorm(200*2), ncol = 2)
x[1:100,] = x[1:100,]+2
x[101:150,] = x[101:150,]-2
class = c(rep(1,150),rep(2,50))

library(ggplot2)
ggplot(data.frame(x), aes(X1, X2, colour = factor(class))) +
  geom_point()

9.6.4 SVM with Multiple Classes

If the response is a factor containing more than two levels, then the svm() function will perform multi-class classification using the one-versus-one approach. We explore that setting here by generating a third class of observations:

x = rbind(x, matrix(rnorm(50*2), ncol = 2))
class = c(class, rep(0,50))
x[class == 0,2] = x[class == 0,2]+2
data_3_classes = data.frame(x = x, class = as.factor(class))

ggplot(data_3_classes, aes(x.1, x.2, colour = factor(class))) +
  geom_point()

Fitting an SVM to multiclass data uses identical syntax to fitting a simple two-class model:

svmfit = svm(class~., data = data_3_classes, kernel = "radial", cost = 10, gamma = 1)
plot(svmfit, data_3_classes)

The e1071 library can also be used to perform support vector regression, if the response vector that is passed in to svm() is numerical rather than a factor.

9.6.5 Application to Gene Expression Data

We now examine the Khan dataset from the ISLR library, which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. For each tissue sample, gene expression measurements are available.

The data set consists of training data, xtrain and ytrain, and testing data, xtest and ytest:

library(ISLR)
library(dplyr)
names(Khan)

Let’s take a look at the dimensions of this dataset:

dim(Khan$xtrain)
dim(Khan$xtest)

This data set consists of expression measurements for 2,308 genes. The training and test sets consist of 63 and 20 observations respectively. Let’s see how the classes compare:

table(Khan$ytrain)
table(Khan$ytest)

We will use a support vector approach to predict cancer subtype using gene expression measurements. In this dataset, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

khan_train = data.frame(x = Khan$xtrain, y = as.factor(Khan$ytrain))
khan_svm = svm(y~., data = khan_train, kernel = "linear", cost = 10)
table(khan_svm$fitted, khan_train$y)

We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes. We are most interested not in the support vector classifier’s performance on the training observations, but rather its performance on the test observations:

khan_test = data.frame(x = Khan$xtest, y = as.factor(Khan$ytest))
pred = predict(khan_svm, newdata = khan_test)
table(pred, khan_test$y)

We see that using cost = 10 yields two test set errors on this data.

Problem 9.7.8

Now it’s your turn! In this section of the lab, we’ll try exploring the OJ dataset from the ISLR package. The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded:

summary(OJ)

Let’s start by splitting the dataset into a training set containing a random sample of 800 observations, and a test set containing the remaining observations:

set.seed(1)

OJ_train = OJ %>%
  sample_n(800)

OJ_test = OJ %>%
  setdiff(OJ_train)

In the space below, fit a support vector classifier to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics, and describe the results obtained:

# Your code here:
svm_linear  = 

The code below will generate confusion matrices so we can see how your model does on the training data:

table(OJ_train$Purchase, predict(svm_linear, OJ_train))

And the test data:

table(OJ_test$Purchase, predict(svm_linear, OJ_test))

Now try using the tune() function to select an optimal value for cost, and refit the model using that value. Consider values in the range 0.01 to 10:

# Your code here
tune_out  =  
svm_linear_tuned  = 

# Performance check
table(OJ_test$Purchase, predict(svm_linear_tuned, OJ_test))

Now try fitting an SVM with kernel = "radial", using the default value for gamma and cross-validation to find the best value for cost:

# Your code here
tune_out2  =  
svm_radial_tuned  = 

# Performance check
table(OJ_test$Purchase, predict(svm_radial_tuned, OJ_test))

And now try kernel = "polynomial" with degree = 2:

# Your code here
tune_out  =  
svm_quadratic_tuned  = 

# Performance check
table(OJ_test$Purchase, predict(svm_quadratic_tuned, OJ_test))

To get credit for this lab, post about your best-performing model on the OJ dataset: - Which model performed best on the training data? With which parameters? - Which model performed best on the test data? With which parameters? - What does all this tell you about the dataset?

to Piazza: https://piazza.com/class/isuhyl9uxya5vm?cid=41