## Saturday, May 2, 2015

### Machine Learning: Logistic Regression for Classification

Well, today I spent significant amount of time teaching my Mac to distinguish 0 from 1. The motivation was mostly to check my own understanding of logistic regression.

Logistic regression model for two classes can be summarised as follows:
Let $$P(Y=1 | X=x) = h_w(x)$$ where $$h_w(x) = g(w^Tx)$$ and $$g(x) = \frac{1}{1+e^{-x}}$$. The function $$g(x)$$ is called a sigmoid function which takes value between 0 and 1. Naturally we have $$P(Y=0|X=x) = 1-h_w(x)$$, since we are assuming only two classes. It can be used to map a real value to a probability space, and it has many characteristics that is good for this purpose.

Suppose we are given a set of training sample $$x^{(i)}$$. The likelihood function $$L(w)$$ is given by $$L(w) = \prod_i [ P(Y=y^{(i)}|X=x^{(i)}) ]$$, which can be written as $$L(w) = \prod_i [ h_w(x^{(i)})^y (1 - h_w(x^{(i)}))^{1-y} ]$$ where y is either 0 or 1. Our goal is to maximise L(w). If we define the log-likelihood function $$l(w) = log L(w)$$, maximising $$L(w)$$ is the same as maximising $$l(w)$$, but $$l(w) = \sum_i [ y \log{(h_w(x^{(i)})} + (1-y)\log{(1-h_w(x^{(i)}))}]$$ is easier to optimise. By differentiating $$l(w)$$ with $$w_j$$, we get the direction of maximum gradient $$\frac{ \partial l(w) } { \partial w_j } = \sum_i -x^{(i)}_j(y - h_w(x^{(i)}))$$. Hence gradient descent can be used to solve for $$w$$ that maximises the log-likelihood.

A batch gradient descent implementation of the logistic regression can be found below:

from numpy import *
from numpy import random

# Implementation of Logistic Regression
# Optimization using Gradient Descent

class LogisticRegression:
def __init__(self):
self.learningRate = 0.1
self.convergenceMargin = 0.001

def setSample(self, data, target):
# number of data n
n = shape(data)
# append 1 as the bias
self.data = append(data, ones((n, 1)), axis=1)
self.target = target
self.numOfData, self.dim = shape(self.data)
self.weight = zeros((self.dim, 1))

def regress(self, x):
x = x.reshape(1, self.dim-1)
x = append(x, ones((1,1)), axis=1)
pred = 0
for j in xrange(self.dim):
pred += self.weight[j] * x[j]
if pred < -40:
return 0
if pred > 40:
return 1
return 1.0 / (1.0 + e**(-pred))

def computeLogLikehood(self, data, target):
numOfData = shape(data)
error = 0
for i in xrange(numOfData):
h = self.regress(data[i])
y = target[i]
logh = log(h) if abs(h) > 1e-12 else -1e10
log1_h = log(1-h) if abs(1-h) > 1e-12 else -1e10
error += y * logh + (1-y) * log1_h
return -error

def train(self):
prevError = 1e308
converged = False
epoch = 0
while not converged:
print "epoch: %d" % epoch
epoch += 1
y = zeros((self.numOfData, 1))
for i in xrange(self.numOfData):
y[i] = self.regress(self.data[i,:-1])
dw = zeros((self.dim, 1))
for j in xrange(self.dim):
for i in xrange(self.numOfData):
dw[j] += (y[i] - self.target[i]) * self.data[i][j]
for j in xrange(self.dim):
self.weight[j] -= self.learningRate * dw[j]
curError = self.computeLogLikehood(self.data[:,:-1], self.target)
if abs(curError - prevError) < self.convergenceMargin:
converged = True
prevError = curError