Saturday, May 2, 2015

Machine Learning: Logistic Regression for Classification

Well, today I spent significant amount of time teaching my Mac to distinguish 0 from 1. The motivation was mostly to check my own understanding of logistic regression.

Logistic regression model for two classes can be summarised as follows:
Let \(P(Y=1 | X=x) = h_w(x) \) where \(h_w(x) = g(w^Tx)\) and \(g(x) = \frac{1}{1+e^{-x}}\). The function \(g(x)\) is called a sigmoid function which takes value between 0 and 1. Naturally we have \(P(Y=0|X=x) = 1-h_w(x)\), since we are assuming only two classes. It can be used to map a real value to a probability space, and it has many characteristics that is good for this purpose.



Suppose we are given a set of training sample \(x^{(i)}\). The likelihood function \(L(w)\) is given by \(L(w) = \prod_i [ P(Y=y^{(i)}|X=x^{(i)}) ] \), which can be written as \( L(w) = \prod_i [ h_w(x^{(i)})^y (1 - h_w(x^{(i)}))^{1-y} ] \) where y is either 0 or 1. Our goal is to maximise L(w). If we define the log-likelihood function \(l(w) = log L(w)\), maximising \(L(w)\) is the same as maximising \(l(w)\), but \(l(w) = \sum_i [ y \log{(h_w(x^{(i)})} + (1-y)\log{(1-h_w(x^{(i)}))}] \) is easier to optimise. By differentiating \(l(w)\) with \(w_j\), we get the direction of maximum gradient \(\frac{ \partial l(w) } { \partial w_j } = \sum_i -x^{(i)}_j(y - h_w(x^{(i)})) \). Hence gradient descent can be used to solve for \(w\) that maximises the log-likelihood.

A batch gradient descent implementation of the logistic regression can be found below:


from numpy import *
from numpy import random

# Implementation of Logistic Regression
# Optimization using Gradient Descent

class LogisticRegression:
    def __init__(self):
        self.learningRate = 0.1
        self.convergenceMargin = 0.001

    def setSample(self, data, target):
        # number of data n
        n = shape(data)[0]
        # append 1 as the bias
        self.data = append(data, ones((n, 1)), axis=1)
        self.target = target
        self.numOfData, self.dim = shape(self.data)
        self.weight = zeros((self.dim, 1))

    def regress(self, x):
        x = x.reshape(1, self.dim-1)
        x = append(x, ones((1,1)), axis=1)
        pred = 0
        for j in xrange(self.dim):
            pred += self.weight[j] * x[0][j]
        if pred < -40:
            return 0
        if pred > 40:
            return 1
        return 1.0 / (1.0 + e**(-pred))

    def computeLogLikehood(self, data, target):
        numOfData = shape(data)[0]
        error = 0
        for i in xrange(numOfData):
            h = self.regress(data[i])
            y = target[i]
            logh = log(h) if abs(h) > 1e-12 else -1e10
            log1_h = log(1-h) if abs(1-h) > 1e-12 else -1e10
            error += y * logh + (1-y) * log1_h
        return -error

    def train(self):
        prevError = 1e308
        converged = False
        epoch = 0
        while not converged:
            print "epoch: %d" % epoch
            epoch += 1
            y = zeros((self.numOfData, 1))
            for i in xrange(self.numOfData):
                y[i] = self.regress(self.data[i,:-1])
            dw = zeros((self.dim, 1))
            for j in xrange(self.dim):
                for i in xrange(self.numOfData):
                    dw[j] += (y[i] - self.target[i]) * self.data[i][j]
            for j in xrange(self.dim):
                self.weight[j] -= self.learningRate * dw[j]
            curError = self.computeLogLikehood(self.data[:,:-1], self.target)
            if abs(curError - prevError) < self.convergenceMargin:
                converged = True
            prevError = curError