Logistic regression model for two classes can be summarised as follows:

Let \(P(Y=1 | X=x) = h_w(x) \) where \(h_w(x) = g(w^Tx)\) and \(g(x) = \frac{1}{1+e^{-x}}\). The function \(g(x)\) is called a sigmoid function which takes value between 0 and 1. Naturally we have \(P(Y=0|X=x) = 1-h_w(x)\), since we are assuming only two classes. It can be used to map a real value to a probability space, and it has many characteristics that is good for this purpose.

Suppose we are given a set of training sample \(x^{(i)}\). The likelihood function \(L(w)\) is given by \(L(w) = \prod_i [ P(Y=y^{(i)}|X=x^{(i)}) ] \), which can be written as \( L(w) = \prod_i [ h_w(x^{(i)})^y (1 - h_w(x^{(i)}))^{1-y} ] \) where y is either 0 or 1. Our goal is to maximise L(w). If we define the log-likelihood function \(l(w) = log L(w)\), maximising \(L(w)\) is the same as maximising \(l(w)\), but \(l(w) = \sum_i [ y \log{(h_w(x^{(i)})} + (1-y)\log{(1-h_w(x^{(i)}))}] \) is easier to optimise. By differentiating \(l(w)\) with \(w_j\), we get the direction of maximum gradient \(\frac{ \partial l(w) } { \partial w_j } = \sum_i -x^{(i)}_j(y - h_w(x^{(i)})) \). Hence gradient descent can be used to solve for \(w\) that maximises the log-likelihood.

A batch gradient descent implementation of the logistic regression can be found below:

from numpy import * from numpy import random # Implementation of Logistic Regression # Optimization using Gradient Descent class LogisticRegression: def __init__(self): self.learningRate = 0.1 self.convergenceMargin = 0.001 def setSample(self, data, target): # number of data n n = shape(data)[0] # append 1 as the bias self.data = append(data, ones((n, 1)), axis=1) self.target = target self.numOfData, self.dim = shape(self.data) self.weight = zeros((self.dim, 1)) def regress(self, x): x = x.reshape(1, self.dim-1) x = append(x, ones((1,1)), axis=1) pred = 0 for j in xrange(self.dim): pred += self.weight[j] * x[0][j] if pred < -40: return 0 if pred > 40: return 1 return 1.0 / (1.0 + e**(-pred)) def computeLogLikehood(self, data, target): numOfData = shape(data)[0] error = 0 for i in xrange(numOfData): h = self.regress(data[i]) y = target[i] logh = log(h) if abs(h) > 1e-12 else -1e10 log1_h = log(1-h) if abs(1-h) > 1e-12 else -1e10 error += y * logh + (1-y) * log1_h return -error def train(self): prevError = 1e308 converged = False epoch = 0 while not converged: print "epoch: %d" % epoch epoch += 1 y = zeros((self.numOfData, 1)) for i in xrange(self.numOfData): y[i] = self.regress(self.data[i,:-1]) dw = zeros((self.dim, 1)) for j in xrange(self.dim): for i in xrange(self.numOfData): dw[j] += (y[i] - self.target[i]) * self.data[i][j] for j in xrange(self.dim): self.weight[j] -= self.learningRate * dw[j] curError = self.computeLogLikehood(self.data[:,:-1], self.target) if abs(curError - prevError) < self.convergenceMargin: converged = True prevError = curError