Friday, April 10, 2015

Machine Learning: Support Vector Machine

Have been finding time to understand the concept behind support vector machine (SVM) and finally implemented the learning algorithm in Python. I'll write this post to review my understanding so far.

Support vector machine is in its simplest form is a type of linear classifier. The intuition behind SVM is usually demonstrated in various machine learning books by asking readers this question: amongst several correct linear classifiers, which one would you think is the best? And it is highlighted that we will tend to prefer the one which has bigger margin between the points closest to the classifier hyper plane. This is then formalised mathematically, which in the end reducible as a type of convex optimisation in particular solvable using a quadratic programming algorithm.



Given (x, y) a data point, let \( f(x) = g(w^T x + b) \) be the classifier function ( \(g(x) = 1\) if \(x \geq 0\) and -1 otherwise), where \(w\) is a weight vector, \(x\) is input vector, \(y\) is a classification (either +1 or -1) and \( b\) is the bias. Define functional margin \(M\) as \(M = y(w^Tx + b) \). If \(w\) and \(b\) is scaled by \(c\), M will also be scaled by \(c\). Hence functional margin is affected by scaling of \(w\) and \(b\).

Next notice that \(w\) is perpendicular to the boundary hyperplane \(w^Tx + b = 0\). Given a point \(x\) at the positive region, let M' be the distance between \(x\) and the boundary plane. We have \(x - M'w = x'\) where \(x'\) is on the plane \(w^Tx+b = 0\). That means \(w^Tx'+b=0\), hence

\(
\begin{align*}
w^T(x-M'w)+b &= 0
\\
w^Tx+b &= M'w^Tw
\\
M' &= \frac{w^Tx}{\| w \|} + \frac{b}{\| w \|}
\end{align*}
\)

We call M' the geometrical margin. Notice M' is not affected by scaling of \(w\) and \(b\).

M' and M is related by \( \| w \| \), when \( \| w \|  = 1 \), M = M'.

Our goal is to maximise the minimum M'. Hence one formulation could be

\(
\begin{array}{cc}
\text{max} & M = min_{i} M^{(i)}
\\
\text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq M^{(i)}
\\
& \| w\| = 1
\end{array}
\)

But the constraint \( \| w \| \) is not linear. So let go of the \(\| w\| = 1\) condition. Instead we will fix \(M = 1\). Doing so does not lead to any loss of generality, since once we do the required scaling on \(w\) and \(b\), the geometrical margin \( M'^{(i)} \) are not affected, hence we still maximising the same M' but with the new condition that \(M' = \frac{1}{\| w \|} \). Then notice that maximising \( \frac{1}{\| w\|} \) is equivalent to minimising \( \frac{1}{2} w^Tw \), where the coefficient half is just for make the derivation more convenient later on. Hence the optimisation problem can be stated as

\(
\begin{array}{cc}
\text{min} & \frac{1}{2} w^Tw
\\
\text{such that} & y^{(i)}(w^Tx^{(i)}+b) \geq 1
\\
\end{array}
\)

Let's have the Lagrangian function:
\( L(w,b,\alpha) = \frac{1}{2} w^Tw + \sum_i \left[ \alpha_i y_i(w^Tx^{(i)}+b)-1 \right] \)
with \(\alpha \geq 0\).

Computing the Lagrangian stationarity: 
\( \frac{\partial L(w,b, \alpha) } { \partial w } = 0 \)
\( \frac{\partial L(w,b, \alpha) } { \partial b } = 0 \)

Which leads to
\( w = \sum_i \alpha_i y_i x^{(i)}\)
and
\( b = \sum_i \alpha_i y_i \)

Substitute those expression back into the Lagrangian, we have
\( L (w, b, \alpha) = L_D(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j (x^{(i)})^Tx^{(j)} \)
which is the dual form of the Lagrangian.

And if primal feasibility ( \( 1 - y^{(i)}(w^Tx^{(i)}+b) \leq 0 \) ),  dual feasibility ( \( \alpha_i \geq 0 \) ), complementary slackness ( \( \alpha_i y_i (w^Tx^{(i)}+b) = 0 \) ), and Lagrangian stationarity are satisfied, KKT conditions are satisfied and the dual optimal will be of primal optimal.

Hence the SVM reduces to the following optimisation problem:

\(
\begin{array}{cc}
\text{min} & L_D(\alpha)
\\
\text{such that} & \alpha \geq 0
& \sum_i \alpha_i y_i = 0
\\
\end{array}
\)

This is called the hard classifier version of the SVM, which assumes that the whole sample is linearly separable. The generalisation to that is the soft classifier SVM which allows some error in the classification by allowing some points to have a geometrical margin of less than the desired minimum margin. This is done by setting \(w^Tx+b \geq M(1-e) \) which in the end leads to a slightly more verbose Lagrangian function, which after careful algebraic manipulation will lead to the following optimisation problem

\(
\begin{array}{cc}
\text{min} & L_D(\alpha)
\\
\text{such that} &  0 \leq \alpha \leq C
& \sum_i \alpha_i y_i = 0
\\
\end{array}
\)

Where C can be seen as the cost of error, or also can be seen as the regularisation term of SVM.

Finally to what makes SVM really loved by people is its special power to make use of Kernel, especially the radial basis function (RBF) kernel \(e^{|-\gamma (x' - x)|} \) which allows us to transform the linear SVM to an infinite dimensional space without incurring the cost of computation. This idea is what makes SVM really powerful, and the effect of kernel does not affect the optimisation problem at all.

In order to make SVM reach its optimal point, quadratic programming packages are usually used. However there is an equally, or maybe more attractive optimisation algorithm called the sequential minimal optimisation (SMO) which employs the idea of coordinate ascent algorithm in order to minimise \(L_D(\alpha)\). It is done by choosing two alphas at a time, and by working out the algebra the update rule turns out to be very simple and sweet. I have implemented the algorithm using Python, which I will put below.



# Implementation of Support Vector Machine with RBF Kernel
# Making use of Sequential Minimal Optimization

from numpy import *
from pylab import *
from numpy import random

class SVM:
    EPS = 1e-12
    def __init__(self):
        self.C = 1e100
        self.error = 1e-3
        self.nonBoundedList = []
        self.kernel = RbfKernel();

    def setTrainingData(self, inputData, targetData):
        self.inputData = inputData
        self.targetData = targetData
        self.numOfData, self.inputDimension = shape(inputData)

    def setKernel(self, kernel):
        self.kernel = kernel

    def setConstraint(C):
        self.C = C

    def setTolerance(error):
        self.error = error

    def train(self):
        self.alpha = zeros((self.numOfData, 1))
        self.E = zeros((self.numOfData, 1))
        self.isCached = [False for i in xrange(self.numOfData)]
        self.bias = 0
        # Heuristic for choosing first example
        self.readAll = True
        self.kktViolated = False
        while self.kktViolated or self.readAll:
            self.kktViolated = False
            if self.readAll:
                for i in xrange(self.numOfData):
                    if self.examineExample(i):
                        self.kktViolated = True
            else:
                for i in xrange(self.numOfData):
                    if (self.isUnbounded(self.alpha[i])):
                        if self.examineExample(i):
                            self.kktViolated = True
            print "readAll: ", self.readAll, " kktviolated: ", self.kktViolated
            if self.readAll:
                self.readAll = False
            elif not self.kktViolated:
                self.readAll = True

    def examineExample(self, i):
        E = self.computeError(i);
        r = E*self.targetData[i];
        if (r < -self.error and self.alpha[i] < self.C) or (r > self.error and self.alpha[i] > 0):
            # Heuristic for choosing second example

            # First heuristic: find the maximum |E1 - E2|
            j = -1
            dE = 0
            for k in xrange(self.numOfData):
                if fabs(self.computeError(k) - self.computeError(i)) > dE and i != k:
                    dE = fabs(self.computeError(k) - self.computeError(i))
                    j = k
            if j != -1:
                if self.takeStep(i, j):
                    return True

            # Second heuristic: amongst unbounded, try taking step
            # randomize
            index = [k for k in xrange(self.numOfData)]
            random.shuffle(index)
            for k in index:
                if self.isUnbounded(self.alpha[k]):
                    if self.takeStep(i, k):
                        return True

            random.shuffle(index)
            for k in index:
                if self.takeStep(i, k):
                    return True
        return False

    def computeError(self, i):
        error = self.E[i] if self.isCached[i] else self.computeMargin(self.inputData[i]) - self.targetData[i]
        self.E[i] = error
        self.isCached[i] = True
        return error
    
    def isUnbounded(self, alpha):
        return 0 < alpha and alpha < self.C

    def takeStep(self, i, j):
        if i == j:
            return False

        s = self.targetData[i] * self.targetData[j]
        L = max(0, self.alpha[j] + s * self.alpha[i] - (1+s)/2 * self.C)
        H = min(self.C, s * self.alpha[i] + self.alpha[j] + (1-s)/2 * self.C)
        if L == H:
            return False
        E1 = self.computeError(i)
        E2 = self.computeError(j)
        if fabs(E1-E2) < SVM.EPS:
            return False
        
        k11 = self.kernel.compute(self.inputData[i], self.inputData[i])
        k22 = self.kernel.compute(self.inputData[j], self.inputData[j])
        k12 = self.kernel.compute(self.inputData[i], self.inputData[j])
        eta = 2 * k12 - k11 - k22

        if (eta < 0):
            a2 = (E2 - E1) * self.targetData[j] / eta + self.alpha[j]
        else:
            c1 = eta/2
            c2 = self.targetData[j] * (E1 - E2) - eta * self.alpha[j]
            Lobj = c1 * L * L + c2 * L
            Hobj = c1 * H * H + c2 * H
            a2 = L if Lobj > Hobj + SVM.EPS else self.alpha[j]
            a2 = H if Hobj > Lobj + SVM.EPS else a2

        a2 = a2 if a2 > L else L
        a2 = a2 if a2 < H else H
        a1 = self.alpha[i] + s * self.alpha[j] - s * a2

        if fabs(self.alpha[j]) < SVM.EPS:
            self.alpha[j] = 0

        if fabs(self.alpha[j] - self.C) < SVM.EPS:
            self.alpha[j] = self.C

        if fabs(a2 - self.alpha[j]) < SVM.EPS * (self.alpha[j] + a2 + SVM.EPS):
            return False

        b1 = E1 + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias;
        b2 = E2 + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias;
        newBias = (b1+b2)/2
        self.E[i] = self.E[i] + self.targetData[i] * (a1 - self.alpha[i]) * k11 + self.targetData[j] * (a2 - self.alpha[j]) * k12 + self.bias - newBias
        self.E[j] = self.E[j] + self.targetData[i] * (a1 - self.alpha[i]) * k12 + self.targetData[j] * (a2 - self.alpha[j]) * k22 + self.bias - newBias
        self.bias = newBias
        self.alpha[i] = a1
        self.alpha[j] = a2
        return True

    def computeMargin(self, input):
        ret = 0
        for i in xrange(self.numOfData):
            prod = self.kernel.compute(self.inputData[i], input)
            ret += self.alpha[i] * self.targetData[i] * prod
        return ret

    def classify(self, input):
        ret = self.computeMargin(input)
        return 1 if ret - self.bias >= 0 else -1


    def getNumberOfSV(self):
        ret = 0
        for i in xrange(self.numOfData):
            ret += 1 if (self.isUnbounded(self.alpha[i])) else 0
        return ret


class RbfKernel:
    def __init__(self):
        self.gamma = 1
        pass

    def setGamma(self, gamma):
        self.gamma = gamma

    def compute(self, x1, x2):
        return exp(-linalg.norm(x1-x2) * self.gamma)