This post demonstrated how classification can be done using Support Vector Machine (SVM). SVM is a supervised learning algorithm used to classify the training data set by finding an optimal hyperplane that separates the class labels. The purposes of training, validating and test datasets in the course of model development are discussed. The dataset is obtained from a Machine Learning online course material offered by Dr Andrew Ng through Coursera. We have implemented SVM in the class using Octave software. Here I show how to implement it using R as well as the importance of the three datasets.
Before implementing SVM in R, I briefly explain the theories behind hypothesis function, cost function and decision boundary. Then I will show step by step how to implement SVM using R.
Hypothesis Function: The hypothesis function $(h_\theta(x))$ is the function that is used to estimate the output (y) from the input data (X). The hypothesis function for SVM is: $$ h_\theta(x)= \frac{1}{1+e^{-z}}$$
$$z=\theta^Tx$$
Where $\theta$ is a vector of weight parameters that are estimated from objective function (see below). $\theta^T$ stands for the transpose of $\theta$. X (capital) is the matrix of predictor variables or features, x (small) is a vector of the value of predictor variables of an observation. Unlike logistic regression, $h_\theta(x)$ is not a probability that the output is 1 or 0. Rather, it outputs either 0 or 1. $$h_\theta(x)=\{\begin{array}{rcl}1&\mbox{if}&\theta^Tx\geq0\\0 & \mbox{otherwise}\end{array}$$
Cost Function for SVM: Cost function measures the accuracy of the hypothesis function. This means that it measures how much the predicted output deviates from the actual value. The parameters $(\theta)$ are estimated in such a way that the cost function is minimized. The cost function for SVM with regularization is: $$J(\theta)=C\sum_{i=1}^m[-y^{(i)}(\log(h_\theta(x^{(i)})))-(1-y^{(i)})\log{(1-h_\theta(x^{(i)})})]+ \frac{1}{2}\sum_{j=1}^n\theta_j^2$$
$$J(\theta)=C\sum_{i=1}^m[y^{(i)}(-\log{\frac{1}{1+e^{-\theta^Tx^{(i)}}}})+ (1-y^{(i)})(-\log(1-{\frac{1}{1+e^{-\theta^Tx^{(i)}}}}))] + \frac{1}{2}\sum_{j=1}^n\theta_j^2$$$$J(\theta)=C\sum_{i=1}^m[y^{(i)}(cost_1(\theta^Tx^{(i)}))+ (1-y^{(i)})(cost_0(\theta^Tx^{(i)}))] + \frac{1}{2}\sum_{j=1}^n\theta_j^2$$Where $J(\theta)$ is the cost function we want to minimize, $y^{(i)}$ is the actual class label of the $i^{th}$ observation, C is regularization parameter that is used to control under fitting or overfitting, $cost_1(\theta^Tx^{(i)})$ is the cost of classifying when y=1, and $cost_0(\theta^Tx^{(i)})$ is the cost of classifying when y=0.
Decision Boundary: The decision boundary or hyperplane in SVM is created in such a way that the distances from the boundary to its closest data points are as large as possible. Those data points closest to the decision boundary are called support vectors and influences optimality. The distance from the boundary to the support vectors is called margin. SVM maximizes the margin around the separating hyperplane. The regularization parameter (C) has to be very large in order to achieve large margin. When C is large, the term $\sum_{i=1}^m[y^{(i)}(cost_1(\theta^Tx^{(i)}))+ (1-y^{(i)})(cost_0(\theta^Tx^{(i)}))]$ from the cost function has to be close to zero to have minimum cost. Thus, the cost function reduces to : $$J(\theta)=C*0 + \frac{1}{2}\sum_{j=1}^n\theta_j^2$$ $$J(\theta)=\frac{1}{2}\sum_{j=1}^n\theta_j^2=\frac{1}{2}||\theta||^2$$
and the Objective function that minimize the cost function is given as $$\min J(\theta)=min\frac{1}{2}||\theta||^2$$ with Constraints:
$\theta^Tx^{(i)}\geq 1$ if $y^{(i)}=1$ and
$\theta^Tx^{(i)}\leq -1$ if $y^{(i)}=0$
We can rewrite the constraint $\theta^Tx^{(i)}$ as inner product as: $\theta^Tx^{(i)}=p^{(i)}||\theta||=\theta_1x_1^{(i)}+\theta_2x_2^{(i)}+...$ Where $p^{(i)}$ is the projection of $x^{(i)}$ on $\theta$ and represents the distance from the decision boundary to $x^{(i)}$ while $\theta$ is perpendicular to the decision boundary. Thus,the constraint are:
$p^{(i)}||\theta||\geq1$ if $y^{(i)}=1$ and
$p^{(i)}||\theta||\leq-1$ if $y^{(i)}=0$
The above constraints can be true when the absolute value of projection $p^{(i)}$ is large with minimum $||\theta||$ . This is why SVM maximizes the margin around the separating hyperplane. For datasets that have non-linear decision boundary we use kernel function. In this example, we use Gaussian kernel to find the decision boundary.
library(ggplot2) # for Visualization
library(e1071 ) # to perform SVM
library(plot3D) # for 3D plot
library(R.matlab) # reading .mat file in R
library(gridExtra) # to arrange multiple ggplot graphs in a page
mydata= readMat("ex6data2.mat")
mydata=as.data.frame(mydata) # convert the data into data frame
mydata$y=as.factor(mydata$y)
head(mydata) # look the first few observations
Once the data was imported into R, it was split into training, validating and test datasets with a ratio of 60:20:20 respectively.
tr_size= round(nrow(mydata)* 0.6) # the size of training dataset (60%)
vsize= nrow(mydata)-tr_size # the size of validating and test data set
tr_sample= sample(nrow(mydata), tr_size,replace = FALSE, set.seed(1)) # randomly select training dataset
train_data= mydata[tr_sample, ]
#Again split the data into validating dataset and test dataset
new_data= mydata[-tr_sample, ]
val_sample=sample(nrow(new_data),ceiling(vsize*0.5), replace=FALSE,set.seed(1) )
val_data=new_data[val_sample,]
test_data= new_data[-val_sample,]
# check the samples generated for each dataset
nrow(test_data) # number of observation in test dataset
nrow(val_data) # number of observation in validating dataset
nrow(train_data) # number of observation in training dataset
nrow(mydata) # number observation of the whole dataset
# sets the height and width of the plots
options(repr.plot.width=5)
options(repr.plot.height=4)
# plot the complete dataset
ggplot(data= mydata, aes(x=X.1, y= X.2)) + geom_point(aes(shape=y)) +geom_point(aes(color=y))+ scale_shape_manual(values=c(1,3)) + ggtitle("Scatter plot of the complete dataset")
# plot the training dataset
ggplot(data= train_data, aes(x=X.1, y= X.2)) + geom_point(aes(shape=y)) +geom_point(aes(color=y))+ scale_shape_manual(values=c(1,3)) + ggtitle("Scatter plot of the training dataset")
# Plot Validaing dataset
ggplot(data= val_data , aes(x=X.1, y= X.2)) + geom_point(aes(shape=y)) +geom_point(aes(color=y))+ scale_shape_manual(values=c(1,3)) + ggtitle("Scatter plot of the validating dataset")
# plot test dataset
ggplot(data= test_data, aes(x=X.1, y= X.2)) + geom_point(aes(shape=y)) +geom_point(aes(color=y))+ scale_shape_manual(values=c(1,3)) + ggtitle("Scatter plot of the test dataset")
I used the training dataset to develop a classification model. A built in R function of svm() from “e1071” library was used to train the model. The function has two arguments, cost and gamma, to be set by the user. After making different models using training dataset with different combinations of “cost” and “gamma” values, the class lables of validating dataset was predicted. Then, the deviation of the prediction from the actual value is computed for each model. A model that has the small error for validating dataset is chosen as the best model because it has high generalizability for a new dataset.
The tune() function from “e1071” library can be applied to do the aforementioned job as follows:
parm= tune(svm, train.x=train_data[,-3], train.y = as.factor(train_data$y), validation.x=val_data[,-3], validation.y = as.factor(val_data$y), range=list( gamma=c(0.01,0.03,0.1,0.3,1,3,10,30),cost=c(0.01,0.03,0.1,0.3,1,3,10,30)), tunecontrol = tune.control(sampling = "fix", fix = 1))
#summary(parm) # shows the performance
The above functions compares the performance of 64 models (Combinations of 8 gamma value and and 8 cost values of {0.01, 0.1, 0.03, 0.3, 3, 1, 10, 30}).
# counter plot of parameters
plot(parm)
# 3D plot of parameters with best performance
#---------------------------------------------------------
gamma_values=unique(parm$performances[1]$gamma)
cost_values=unique(parm$performances[2]$cost)
persp3D(x= gamma_values , y=cost_values, z=matrix(parm$performances[3]$error,nrow = length(gamma_values),ncol = length(cost_values))
, theta=-45, phi=0, expand=0.6,ticktype='detailed',col='blue', shade=0.75, ltheta=120, xlab="gamma",ylab="cost",zlab='Error', main='Performance of SVM for different gamma and cost values', cex.main=0.9)
The value of gamma and cost that gave the best performance for validating dataset among the 64 combinations of values are:
parm$best.parameters
# choose the best parameters and use them to make a model
best_cost=parm$best.parameters$cost
best_gamma=parm$best.parameters$gamma
best_model= svm(y ~ . ,data =train_data, type= "C-classification", kernel= "radial", cost=best_cost , gamma=best_gamma)
summary(best_model) # the summary of the best model
x1plot = seq(min(train_data$X.1), max(train_data$X.1), length= 100)
x2plot = seq(min(train_data$X.2), max(train_data$X.2), length= 100)
m = mesh(x=x1plot, y=x2plot)
vals = matrix(0, nrow=nrow(m$x), ncol = ncol(m$x)) # place holder
for(i in 1: ncol(m$x))
{
this_X = cbind(m$x[,i], m$y[, i])
vals[,i] = predict(best_model, this_X)
}
#-------------------------------------------------------------------
# Plot the SVM boundary
#-------------------------------------------------------------------
posw=which(mydata$y%in%1) # index for y=1
negw=which(mydata$y%in%0) # index for y=0
plot( X.2~ X.1, data=mydata[posw,-3], col="cyan3", pch=18, main='The complete dataset with non-linear decision boundary learned by SVM',cex.main=0.8)
points( X.2 ~ X.1, data=mydata[negw,-3], col="coral1", pch=20)
legend('topright',title='y', c('0','1'),cex=1, pch=c(20,18), col=c('coral1','cyan3'), y.intersp=1.5)
par(new=TRUE) # to add the boundary line to the original data set
contour(x=m$x[,1], y=m$y[1,], z=vals, nlevels = 10, col='black', drawlabels = FALSE) # the boundary line
The next plot shows two decision boundaries computed from the best model (shown in black color) and from the other model (shown in yellow color) with cost=0.1 and gamma=3. As shown in the plot, the best model predicts the class labels of y better than the other model.
model2=svm(y ~ . ,data =train_data, type= "C-classification", kernel= "radial", cost=0.1 , gamma=3)
x1plot2 = seq(min(train_data$X.1), max(train_data$X.1), length= 100)
x2plot2 = seq(min(train_data$X.2), max(train_data$X.2), length= 100)
m2 = mesh(x=x1plot, y=x2plot)
vals2 = matrix(0, nrow=nrow(m2$x), ncol = ncol(m2$x)) # place holder
for(i in 1: ncol(m2$x))
{
this_X2 = cbind(m2$x[,i], m2$y[, i])
vals2[,i] = predict(model2, this_X2)
}
#-------------------------------------------------------------------
# Plot the SVM boundary
#-------------------------------------------------------------------
plot( X.2~ X.1, data=mydata[posw,-3], col="cyan3", pch=18, main='The complete data with non-linear decision boundary learned by SVM',cex.main=0.8)
points( X.2 ~ X.1, data=mydata[negw,-3], col="coral1", pch=20)
legend('topright',title='y', c('0','1'), pch=c(20,18), col=c('coral1','cyan3'), y.intersp=1.5)
par(new=TRUE) # to add the boundary line to the original data set
contour(x=m2$x[,1], y=m2$y[1,], z=vals2, nlevels = 10, col = "yellow", drawlabels = FALSE) # the boundary line
par(new=TRUE)
contour(x=m$x[,1], y=m$y[1,], z=vals, nlevels = 10, col = "black", drawlabels = FALSE) # the boundary line
The test dataset is used here to test the generalizablity of the model for new dataset. The test dataset is highlighted in the following plot with blue color with two different shapes to indicate the actual class labels of each observation. We can easily visualize that majority of the observations of the test dataset are correctly classified by the decision boundary.
#add test data to visualiza how much the decision boundary separates its class lebeles. This shows the generalizablity of the model
#-----------------------------------------------------------------------------
pos=which(test_data$y%in%1)
neg=which(test_data$y%in%0)
plot(X.2~ X.1, data=mydata[posw,-3], col="cyan3", pch=18, main="The class lables of test dataset as separated by the decision boundary", cex.main=0.8)
points( X.2 ~ X.1, data=mydata[negw,-3], col="coral1" ,pch=20)
par(new=TRUE) # to add the boundary line to the original data set
contour(x=m$x[,1], y=m$y[1,], z=vals, nlevels = 10, col='black', drawlabels = FALSE)
points( X.2~ X.1, data=test_data[pos,], col="blue", pch=8)
points( X.2 ~ X.1, data=test_data[neg,], col="blue", pch=17)
legend('topright',title='Test_data', c('0','1'), pch=c(17,8), col=c('blue','blue'), pt.cex=1, cex=.8, y.intersp=1.5,bty='n', inset=c(0.03,0.02))
The following code results the confusion matrix for training, validating and test data sets.
# compute the performance of the model
# compare the actual and predicted class lables for all datasets(training, validating and test)
# using confusion matrix
compare= function(dataset)
{
pred= predict(best_model,dataset[,-3])
compare= as.data.frame(pred)
compare['actual']= dataset$y
names(compare)=c('predicted', "actual")
table(compare)
}
confusion_matrix_train= compare(train_data)
paste('Confusion matrix for training dataset')
confusion_matrix_train
confusion_matrix_valid=compare(val_data)
paste('Confusion matrix for validating dataset')
confusion_matrix_valid
confusion_matrix_test=compare(test_data)
paste('Confusion matrix for test dataset')
confusion_matrix_test
Performance measures such as error, accuracy, recall and precision are computed for each dataset as follow:
#compute the error,accuracy, precision and recall of model prediction for all the three datasets
performance= function(confusion_matrix)
{
error=sum(confusion_matrix[1,2],confusion_matrix[2,1])/sum(confusion_matrix[1,1],confusion_matrix[1,2],confusion_matrix[2,1],confusion_matrix[2,2])
accuracy= 1-error
recall= confusion_matrix[2,2]/sum(confusion_matrix[2,2], confusion_matrix[1,2])
precision= confusion_matrix[2,2]/sum(confusion_matrix[2,2], confusion_matrix[2,1])
return(c("error"=error,"accuracy"=accuracy,"recall"=recall,"precision"=precision))
}
t(data.frame(Performance_train=performance(confusion_matrix_train),Performance_valid=performance(confusion_matrix_valid), Performance_test=performance(confusion_matrix_test)))
Support vector machine is an efficient classification method especially to make complex and non-linear classification using kernel function. I hope the step by step implementation of SVM in R along with the brief theoretical explanation in this post provides adequate insight.