Support vector machine is in its simplest form is a type of linear classifier. The intuition behind SVM is usually demonstrated in various machine learning books by asking readers this question: amongst several correct linear classifiers, which one would you think is the best? And it is highlighted that we will tend to prefer the one which has bigger margin between the points closest to the classifier hyper plane. This is then formalised mathematically, which in the end reducible as a type of convex optimisation in particular solvable using a quadratic programming algorithm.

Given (x, y) a data point, let \( f(x) = g(w^T x + b) \) be the classifier function ( \(g(x) = 1\) if \(x \geq 0\) and -1 otherwise), where \(w\) is a weight vector, \(x\) is input vector, \(y\) is a classification (either +1 or -1) and \( b\) is the bias. Define functional margin \(M\) as \(M = y(w^Tx + b) \). If \(w\) and \(b\) is scaled by \(c\), M will also be scaled by \(c\). Hence functional margin is affected by scaling of \(w\) and \(b\).

Next notice that \(w\) is perpendicular to the boundary hyperplane \(w^Tx + b = 0\). Given a point \(x\) at the positive region, let M' be the distance between \(x\) and the boundary plane. We have \(x - M'w = x'\) where \(x'\) is on the plane \(w^Tx+b = 0\). That means \(w^Tx'+b=0\), hence

\(

\begin{align*}

w^T(x-M'w)+b &= 0

\\

w^Tx+b &= M'w^Tw

\\

M' &= \frac{w^Tx}{\| w \|} + \frac{b}{\| w \|}

\end{align*}

\)

We call M' the geometrical margin. Notice M' is not affected by scaling of \(w\) and \(b\).

M' and M is related by \( \| w \| \), when \( \| w \| = 1 \), M = M'.

Our goal is to maximise the minimum M'. Hence one formulation could be

\(

\begin{array}{cc}

\text{max} & M = min_{i} M^{(i)}

\\

\text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq M^{(i)}

\\

& \| w\| = 1

\end{array}

\)

But the constraint \( \| w \| \) is not linear. So let go of the \(\| w\| = 1\) condition. Instead we will fix \(M = 1\). Doing so does not lead to any loss of generality, since once we do the required scaling on \(w\) and \(b\), the geometrical margin \( M'^{(i)} \) are not affected, hence we still maximising the same M' but with the new condition that \(M' = \frac{1}{\| w \|} \). Then notice that maximising \( \frac{1}{\| w\|} \) is equivalent to minimising \( \frac{1}{2} w^Tw \), where the coefficient half is just for make the derivation more convenient later on. Hence the optimisation problem can be stated as

\(

\begin{array}{cc}

\text{min} & \frac{1}{2} w^Tw

\\

\text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq 1

\\

\end{array}

\)

Let's have the Lagrangian function:

\( L(w,b,\alpha) = \frac{1}{2} w^Tw + \sum_i \left[ \alpha_i y_i(w^Tx^{(i)}+b)-1 \right] \)

with \(\alpha \geq 0\).

Computing the Lagrangian stationarity:

\( \frac{\partial L(w,b, \alpha) } { \partial w } = 0 \)

\( \frac{\partial L(w,b, \alpha) } { \partial b } = 0 \)

Which leads to

\( w = \sum_i \alpha_i y_i x^{(i)}\)

and

\( b = \sum_i \alpha_i y_i \)

Substitute those expression back into the Lagrangian, we have

\( L (w, b, \alpha) = L_D(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x^{(i)})^Tx^{(j)} \)

which is the dual form of the Lagrangian.

And if primal feasibility ( \( 1 - y^{(i)}(w^Tx^{(i)}+b) \leq 0 \) ), dual feasibility ( \( \alpha_i \geq 0 \) ), complementary slackness ( \( \alpha_i y_i (w^Tx^{(i)}+b) = 0 \) ), and Lagrangian stationarity are satisfied, KKT conditions are satisfied and the dual optimal will be of primal optimal.

Hence the SVM reduces to the following optimisation problem:

\begin{array}{cc}

\text{min} & L_D(\alpha)

\\

\text{such that} & \alpha \geq 0

& \sum_i \alpha_i y_i = 0

\\

\end{array}

\)

This is called the hard classifier version of the SVM, which assumes that the whole sample is linearly separable. The generalisation to that is the soft classifier SVM which allows some error in the classification by allowing some points to have a geometrical margin of less than the desired minimum margin. This is done by setting \(w^Tx+b \geq M(1-e) \) which in the end leads to a slightly more verbose Lagrangian function, which after careful algebraic manipulation will lead to the following optimisation problem

\(

\begin{array}{cc}

\text{min} & L_D(\alpha)

\\

\text{such that} & 0 \leq \alpha \leq C

& \sum_i \alpha_i y_i = 0

\\

\end{array}

\)

Where C can be seen as the cost of error, or also can be seen as the regularisation term of SVM.

Finally to what makes SVM really loved by people is its special power to make use of Kernel, especially the radial basis function (RBF) kernel \(e^{|-\gamma (x' - x)|} \) which allows us to transform the linear SVM to an infinite dimensional space without incurring the cost of computation. This idea is what makes SVM really powerful, and the effect of kernel does not affect the optimisation problem at all.

In order to make SVM reach its optimal point, quadratic programming packages are usually used. However there is an equally, or maybe more attractive optimisation algorithm called the sequential minimal optimisation (SMO) which employs the idea of coordinate ascent algorithm in order to minimise \(L_D(\alpha)\). It is done by choosing two alphas at a time, and by working out the algebra the update rule turns out to be very simple and sweet. I have implemented the algorithm using Python, which I will put below.

# Implementation of Support Vector Machine with RBF Kernel # Making use of Sequential Minimal Optimization from numpy import * from pylab import * from numpy import random class SVM: EPS = 1e-12 def __init__(self): self.C = 1e100 self.error = 1e-3 self.nonBoundedList = [] self.kernel = RbfKernel(); def setTrainingData(self, inputData, targetData): self.inputData = inputData self.targetData = targetData self.numOfData, self.inputDimension = shape(inputData) def setKernel(self, kernel): self.kernel = kernel def setConstraint(C): self.C = C def setTolerance(error): self.error = error def train(self): self.alpha = zeros((self.numOfData, 1)) self.E = zeros((self.numOfData, 1)) self.isCached = [False for i in xrange(self.numOfData)] self.bias = 0 # Heuristic for choosing first example self.readAll = True self.kktViolated = False while self.kktViolated or self.readAll: self.kktViolated = False if self.readAll: for i in xrange(self.numOfData): if self.examineExample(i): self.kktViolated = True else: for i in xrange(self.numOfData): if (self.isUnbounded(self.alpha[i])): if self.examineExample(i): self.kktViolated = True print "readAll: ", self.readAll, " kktviolated: ", self.kktViolated if self.readAll: self.readAll = False elif not self.kktViolated: self.readAll = True def examineExample(self, i): E = self.computeError(i); r = E*self.targetData[i]; if (r < -self.error and self.alpha[i] < self.C) or (r > self.error and self.alpha[i] > 0): # Heuristic for choosing second example # First heuristic: find the maximum |E1 - E2| j = -1 dE = 0 for k in xrange(self.numOfData): if fabs(self.computeError(k) - self.computeError(i)) > dE and i != k: dE = fabs(self.computeError(k) - self.computeError(i)) j = k if j != -1: if self.takeStep(i, j): return True # Second heuristic: amongst unbounded, try taking step # randomize index = [k for k in xrange(self.numOfData)] random.shuffle(index) for k in index: if self.isUnbounded(self.alpha[k]): if self.takeStep(i, k): return True random.shuffle(index) for k in index: if self.takeStep(i, k): return True return False def computeError(self, i): error = self.E[i] if self.isCached[i] else self.computeMargin(self.inputData[i]) - self.targetData[i] self.E[i] = error self.isCached[i] = True return error def isUnbounded(self, alpha): return 0 < alpha and alpha < self.C def takeStep(self, i, j): if i == j: return False s = self.targetData[i] * self.targetData[j] L = max(0, self.alpha[j] + s * self.alpha[i] - (1+s)/2 * self.C) H = min(self.C, s * self.alpha[i] + self.alpha[j] + (1-s)/2 * self.C) if L == H: return False E1 = self.computeError(i) E2 = self.computeError(j) if fabs(E1-E2) < SVM.EPS: return False k11 = self.kernel.compute(self.inputData[i], self.inputData[i]) k22 = self.kernel.compute(self.inputData[j], self.inputData[j]) k12 = self.kernel.compute(self.inputData[i], self.inputData[j]) eta = 2 * k12 - k11 - k22 if (eta < 0): a2 = (E2 - E1) * self.targetData[j] / eta + self.alpha[j] else: c1 = eta/2 c2 = self.targetData[j] * (E1 - E2) - eta * self.alpha[j] Lobj = c1 * L * L + c2 * L Hobj = c1 * H * H + c2 * H a2 = L if Lobj > Hobj + SVM.EPS else self.alpha[j] a2 = H if Hobj > Lobj + SVM.EPS else a2 a2 = a2 if a2 > L else L a2 = a2 if a2 < H else H a1 = self.alpha[i] + s * self.alpha[j] - s * a2 if fabs(self.alpha[j]) < SVM.EPS: self.alpha[j] = 0 if fabs(self.alpha[j] - self.C) < SVM.EPS: self.alpha[j] = self.C if fabs(a2 - self.alpha[j]) < SVM.EPS * (self.alpha[j] + a2 + SVM.EPS): return False b1 = E1 + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias; b2 = E2 + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias; newBias = (b1+b2)/2 self.E[i] = self.E[i] + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias - newBias self.E[j] = self.E[j] + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias - newBias self.bias = newBias self.alpha[i] = a1 self.alpha[j] = a2 return True def computeMargin(self, input): ret = 0 for i in xrange(self.numOfData): prod = self.kernel.compute(self.inputData[i], input) ret += self.alpha[i] * self.targetData[i] * prod return ret def classify(self, input): ret = self.computeMargin(input) return 1 if ret - self.bias >= 0 else -1 def getNumberOfSV(self): ret = 0 for i in xrange(self.numOfData): ret += 1 if (self.isUnbounded(self.alpha[i])) else 0 return ret class RbfKernel: def __init__(self): self.gamma = 1 pass def setGamma(self, gamma): self.gamma = gamma def compute(self, x1, x2): return exp(-linalg.norm(x1-x2) * self.gamma)