## Friday, April 10, 2015

### Machine Learning: Support Vector Machine

Have been finding time to understand the concept behind support vector machine (SVM) and finally implemented the learning algorithm in Python. I'll write this post to review my understanding so far.

Support vector machine is in its simplest form is a type of linear classifier. The intuition behind SVM is usually demonstrated in various machine learning books by asking readers this question: amongst several correct linear classifiers, which one would you think is the best? And it is highlighted that we will tend to prefer the one which has bigger margin between the points closest to the classifier hyper plane. This is then formalised mathematically, which in the end reducible as a type of convex optimisation in particular solvable using a quadratic programming algorithm.

Given (x, y) a data point, let $$f(x) = g(w^T x + b)$$ be the classifier function ( $$g(x) = 1$$ if $$x \geq 0$$ and -1 otherwise), where $$w$$ is a weight vector, $$x$$ is input vector, $$y$$ is a classification (either +1 or -1) and $$b$$ is the bias. Define functional margin $$M$$ as $$M = y(w^Tx + b)$$. If $$w$$ and $$b$$ is scaled by $$c$$, M will also be scaled by $$c$$. Hence functional margin is affected by scaling of $$w$$ and $$b$$.

Next notice that $$w$$ is perpendicular to the boundary hyperplane $$w^Tx + b = 0$$. Given a point $$x$$ at the positive region, let M' be the distance between $$x$$ and the boundary plane. We have $$x - M'w = x'$$ where $$x'$$ is on the plane $$w^Tx+b = 0$$. That means $$w^Tx'+b=0$$, hence

\begin{align*} w^T(x-M'w)+b &= 0 \\ w^Tx+b &= M'w^Tw \\ M' &= \frac{w^Tx}{\| w \|} + \frac{b}{\| w \|} \end{align*}

We call M' the geometrical margin. Notice M' is not affected by scaling of $$w$$ and $$b$$.

M' and M is related by $$\| w \|$$, when $$\| w \| = 1$$, M = M'.

Our goal is to maximise the minimum M'. Hence one formulation could be

$$\begin{array}{cc} \text{max} & M = min_{i} M^{(i)} \\ \text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq M^{(i)} \\ & \| w\| = 1 \end{array}$$

But the constraint $$\| w \|$$ is not linear. So let go of the $$\| w\| = 1$$ condition. Instead we will fix $$M = 1$$. Doing so does not lead to any loss of generality, since once we do the required scaling on $$w$$ and $$b$$, the geometrical margin $$M'^{(i)}$$ are not affected, hence we still maximising the same M' but with the new condition that $$M' = \frac{1}{\| w \|}$$. Then notice that maximising $$\frac{1}{\| w\|}$$ is equivalent to minimising $$\frac{1}{2} w^Tw$$, where the coefficient half is just for make the derivation more convenient later on. Hence the optimisation problem can be stated as

$$\begin{array}{cc} \text{min} & \frac{1}{2} w^Tw \\ \text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq 1 \\ \end{array}$$

Let's have the Lagrangian function:
$$L(w,b,\alpha) = \frac{1}{2} w^Tw + \sum_i \left[ \alpha_i y_i(w^Tx^{(i)}+b)-1 \right]$$
with $$\alpha \geq 0$$.

Computing the Lagrangian stationarity:
$$\frac{\partial L(w,b, \alpha) } { \partial w } = 0$$
$$\frac{\partial L(w,b, \alpha) } { \partial b } = 0$$

$$w = \sum_i \alpha_i y_i x^{(i)}$$
and
$$b = \sum_i \alpha_i y_i$$

Substitute those expression back into the Lagrangian, we have
$$L (w, b, \alpha) = L_D(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x^{(i)})^Tx^{(j)}$$
which is the dual form of the Lagrangian.

And if primal feasibility ( $$1 - y^{(i)}(w^Tx^{(i)}+b) \leq 0$$ ),  dual feasibility ( $$\alpha_i \geq 0$$ ), complementary slackness ( $$\alpha_i y_i (w^Tx^{(i)}+b) = 0$$ ), and Lagrangian stationarity are satisfied, KKT conditions are satisfied and the dual optimal will be of primal optimal.

Hence the SVM reduces to the following optimisation problem:

$$\begin{array}{cc} \text{min} & L_D(\alpha) \\ \text{such that} & \alpha \geq 0 & \sum_i \alpha_i y_i = 0 \\ \end{array}$$

This is called the hard classifier version of the SVM, which assumes that the whole sample is linearly separable. The generalisation to that is the soft classifier SVM which allows some error in the classification by allowing some points to have a geometrical margin of less than the desired minimum margin. This is done by setting $$w^Tx+b \geq M(1-e)$$ which in the end leads to a slightly more verbose Lagrangian function, which after careful algebraic manipulation will lead to the following optimisation problem

$$\begin{array}{cc} \text{min} & L_D(\alpha) \\ \text{such that} & 0 \leq \alpha \leq C & \sum_i \alpha_i y_i = 0 \\ \end{array}$$

Where C can be seen as the cost of error, or also can be seen as the regularisation term of SVM.

Finally to what makes SVM really loved by people is its special power to make use of Kernel, especially the radial basis function (RBF) kernel $$e^{|-\gamma (x' - x)|}$$ which allows us to transform the linear SVM to an infinite dimensional space without incurring the cost of computation. This idea is what makes SVM really powerful, and the effect of kernel does not affect the optimisation problem at all.

In order to make SVM reach its optimal point, quadratic programming packages are usually used. However there is an equally, or maybe more attractive optimisation algorithm called the sequential minimal optimisation (SMO) which employs the idea of coordinate ascent algorithm in order to minimise $$L_D(\alpha)$$. It is done by choosing two alphas at a time, and by working out the algebra the update rule turns out to be very simple and sweet. I have implemented the algorithm using Python, which I will put below.

# Implementation of Support Vector Machine with RBF Kernel
# Making use of Sequential Minimal Optimization

from numpy import *
from pylab import *
from numpy import random

class SVM:
EPS = 1e-12
def __init__(self):
self.C = 1e100
self.error = 1e-3
self.nonBoundedList = []
self.kernel = RbfKernel();

def setTrainingData(self, inputData, targetData):
self.inputData = inputData
self.targetData = targetData
self.numOfData, self.inputDimension = shape(inputData)

def setKernel(self, kernel):
self.kernel = kernel

def setConstraint(C):
self.C = C

def setTolerance(error):
self.error = error

def train(self):
self.alpha = zeros((self.numOfData, 1))
self.E = zeros((self.numOfData, 1))
self.isCached = [False for i in xrange(self.numOfData)]
self.bias = 0
# Heuristic for choosing first example
self.kktViolated = False
self.kktViolated = False
for i in xrange(self.numOfData):
if self.examineExample(i):
self.kktViolated = True
else:
for i in xrange(self.numOfData):
if (self.isUnbounded(self.alpha[i])):
if self.examineExample(i):
self.kktViolated = True
elif not self.kktViolated:

def examineExample(self, i):
E = self.computeError(i);
r = E*self.targetData[i];
if (r < -self.error and self.alpha[i] < self.C) or (r > self.error and self.alpha[i] > 0):
# Heuristic for choosing second example

# First heuristic: find the maximum |E1 - E2|
j = -1
dE = 0
for k in xrange(self.numOfData):
if fabs(self.computeError(k) - self.computeError(i)) > dE and i != k:
dE = fabs(self.computeError(k) - self.computeError(i))
j = k
if j != -1:
if self.takeStep(i, j):
return True

# Second heuristic: amongst unbounded, try taking step
# randomize
index = [k for k in xrange(self.numOfData)]
random.shuffle(index)
for k in index:
if self.isUnbounded(self.alpha[k]):
if self.takeStep(i, k):
return True

random.shuffle(index)
for k in index:
if self.takeStep(i, k):
return True
return False

def computeError(self, i):
error = self.E[i] if self.isCached[i] else self.computeMargin(self.inputData[i]) - self.targetData[i]
self.E[i] = error
self.isCached[i] = True
return error

def isUnbounded(self, alpha):
return 0 < alpha and alpha < self.C

def takeStep(self, i, j):
if i == j:
return False

s = self.targetData[i] * self.targetData[j]
L = max(0, self.alpha[j] + s * self.alpha[i] - (1+s)/2 * self.C)
H = min(self.C, s * self.alpha[i] + self.alpha[j] + (1-s)/2 * self.C)
if L == H:
return False
E1 = self.computeError(i)
E2 = self.computeError(j)
if fabs(E1-E2) < SVM.EPS:
return False

k11 = self.kernel.compute(self.inputData[i], self.inputData[i])
k22 = self.kernel.compute(self.inputData[j], self.inputData[j])
k12 = self.kernel.compute(self.inputData[i], self.inputData[j])
eta = 2 * k12 - k11 - k22

if (eta < 0):
a2 = (E2 - E1) * self.targetData[j] / eta + self.alpha[j]
else:
c1 = eta/2
c2 = self.targetData[j] * (E1 - E2) - eta * self.alpha[j]
Lobj = c1 * L * L + c2 * L
Hobj = c1 * H * H + c2 * H
a2 = L if Lobj > Hobj + SVM.EPS else self.alpha[j]
a2 = H if Hobj > Lobj + SVM.EPS else a2

a2 = a2 if a2 > L else L
a2 = a2 if a2 < H else H
a1 = self.alpha[i] + s * self.alpha[j] - s * a2

if fabs(self.alpha[j]) < SVM.EPS:
self.alpha[j] = 0

if fabs(self.alpha[j] - self.C) < SVM.EPS:
self.alpha[j] = self.C

if fabs(a2 - self.alpha[j]) < SVM.EPS * (self.alpha[j] + a2 + SVM.EPS):
return False

b1 = E1 + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias;
b2 = E2 + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias;
newBias = (b1+b2)/2
self.E[i] = self.E[i] + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias - newBias
self.E[j] = self.E[j] + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias - newBias
self.bias = newBias
self.alpha[i] = a1
self.alpha[j] = a2
return True

def computeMargin(self, input):
ret = 0
for i in xrange(self.numOfData):
prod = self.kernel.compute(self.inputData[i], input)
ret += self.alpha[i] * self.targetData[i] * prod
return ret

def classify(self, input):
ret = self.computeMargin(input)
return 1 if ret - self.bias >= 0 else -1

def getNumberOfSV(self):
ret = 0
for i in xrange(self.numOfData):
ret += 1 if (self.isUnbounded(self.alpha[i])) else 0
return ret

class RbfKernel:
def __init__(self):
self.gamma = 1
pass

def setGamma(self, gamma):
self.gamma = gamma

def compute(self, x1, x2):
return exp(-linalg.norm(x1-x2) * self.gamma)