## Tuesday, March 31, 2015

### Machine Learning - Neural Networks

I was spending the whole night yesterday to understand and implement fully in principle the underlying mechanism of neural networks, one of the techniques in machine learning. And not without significance, because the introduction of neural networks was the turning point in the history which got machine learning popular. While now it is not as preferred a method as compared to support vector machines (SVM) with kernel, radial base functions (RBF), and other newer methods, I would say that the attractiveness of this model is in the simplicity of which it can be implemented, and also in the kind of intuition that it offers. This post will server as a checking point for me, in which I will narrate and describe mathematically what I understand about the mechanism.

Neural networks learning model was initially heavily influenced by the biological neurons. The idea is simply that when a neuron receive a signal, it will decide whether to respond to this signal by firing another signal to other neighbouring neurons. This sequences of firing (or not firing) will then translate to a complex behaviour. Neural networks model tries to pin this down mathematically using layers of nodes to represent these neurons. Between two consecutive layers $$i$$ and $$j$$, edges are drawn between node $$x_i$$ with nodes $$x_j$$, called a weight $$w_{ij}$$. These weights only exist between consecutive layers, hence we do not draw weights amongst nodes in a common layer, nor we draw weights between layers that are far a part from each other. This simplification of the mathematical model turns out to have a great deal of interesting qualities that allow a machine to learn.

To proceed, we need to define the notations required. Let $$x^{(l)}_{i}$$ be the $$i$$-th node inside layer $$l$$. We specially assign layer  $$l = 0$$ as the input layer. Let $$d^{(l)}$$ be the dimensionality of the layer $$l$$, the number of nodes inside this layer. Let $$w^{(l)}_{ij}$$ be the weight between node $$x^{(l-1)}_{i}$$ to $$x^{(l)}_{j}$$. Let the target output corresponding to a particular input be $$t$$.

Furthermore, let $$h^{(l)}_{j}$$ be the weighted sum of all signals coming from nodes in layer $$l-1$$ arriving to node $$x^{(l)}_j$$. In other words, we have $$h^{(l)}_{j} = \sum_{i=0}^{d^{(l-1)}} x^{(l-1)}_i w^{(l)}_{ij}$$.

Given the weighted sum of signals $$h^{(l)}_j$$, what should be the mechanism to determine whether node $$x^{(l)}_j$$ fires (or rather, what should be its response signal?). This question can be very complex, but the model chooses to distill much of the complexity by introducing what is called the activation function g(x). This activation function is actually not an entirely foreign idea, since we saw it in perceptron learning where g(x) = 1 if x < threshold, and 0 (or -1) otherwise. We also see this in linear regression, where g(x) = x. For neural networks, we would want to have both of these characteristics especially in cases where we want to learn classification problems. Hence we usually have 3 more activation functions that can be used as required: logistic function, hyperbolic function, and softmax function. For this post I will use the hyperbolic version, which is $$g(x) = \tanh {(x)}$$, having the first derivative in the form $$g'(x) = 1 - (g(x))^2$$, which is very beneficial as we'll see. The $$\tanh$$ function is also special as it saturates to round 1 and -1 for large |x| magnitude, but varies linearly for small magnitude of x.

Hence having the activation function, we can compute $$x^{(l)}_j$$ as simply $$g(h^{(l)}_j$$. In fact, performing this computation from input layer to the last layer forms the first phase of the learning, which is usually called forward phase.

Let us now focus on the last layer $$l = L$$. The values $$x^{(L)}_j$$ will some agree, and some will disagree with the target value $$t_j$$. As for any supervised learning, we would like to assign an error function to minimise. In order to do so, for our choice of activation function, we can use the following error function: $$E = \frac{1}{2} \sum_{k}^{d^{L}} (t_{k} - x^{(L)}_k)^2$$ (mean squared error). We will minimise $$E$$ using gradient descent optimisation algorithm.

The next question to ask is, how much each weight $$w^{(l)}_{ij}$$ is responsible for $$E$$? In fact this is the most essential core problem which solution allows the networks to learn. To answer this question we will resort to calculus.

We would like to know $$\frac{\partial E}{\partial w^{(l)}_{ij}}$$, for all layer $$l$$. To do so, we will develop a technique called back-propagation, which is essentially similar to dynamic programming as we are caching values to speed up the calculations. Let's start with the last layer $$L$$.

We know that  $$\frac{\partial E}{\partial w^{(L)}_{ij}} = \frac{\partial E}{\partial h^{(L)}_j} \frac{\partial h^{(L)}_j}{ \partial w^{(L)}_{ij} }$$. The second partial derivative is easy to evaluate, you will get $$\frac{\partial h^{(L)}_j}{ \partial w^{(L)}_{ij} } = x^{(L-1)}_i$$. Next we focus on the first partial derivative.

Let's define another new notation: let $$s^{(l)}_j = \frac{\partial E}{\partial h^{(l)}_j}$$. Hence here we have:
$$s^{(L)}_j = \frac{\partial E}{\partial h^{(L)}_j} = \frac{\partial}{\partial h^{(L)}_j} \frac{1}{2} \sum_k {(g(h^{(L)}_j) - t_j)^2}$$, which simplifies to
$$s^{(L)}_j = (g(h^{(L)}) - t_j) g'(h^{(L)}_j)$$, and we know that $$g'(h^{(L)}_j) = 1 - (g(h^{(L)}))^2$$.

Hence overall we have $$\frac{\partial E}{\partial w^{(L)}_{ij}} = s^{(L)}_j x^{(L-1)}_i$$. We then compute all respective values in the last layer and store the results of the computation for later use. This values will form the base case of the back-propagation.

Next, we define the back-propagation for layers other than $$L$$, and we will compute these values in consecutive manner from $$l$$ equals to $$L-1, L-2,$$ and so on. We want to find what is $$\frac{\partial E}{\partial w^{(l)}_{ij}}$$, and here the computation is a bit tougher:
$$\frac{\partial E}{\partial w^{(l)}_{ij}} = \frac{\partial E}{\partial h^{(l)}_j} \frac{\partial h^{(l)}_j}{\partial w^{(l)}_{ij}}$$
The second partial derivative is easy to solve, we will get $$\frac{ \partial h^{(l)}_j}{\partial w^{(l)}_{ij}} = x^{(l-1)}_i$$, similar to the base case.
The first partial derivative, is then:
$$\frac{\partial E}{\partial h^{(l)}_j} = \sum_k \frac{\partial E}{\partial h^{(l+1)}_k} \frac{\partial h^{(l+1)}_k}{\partial h^{(l)}_j}$$
Since $$\frac{\partial E}{\partial h^{(l+1)}_k} = s^{(l+1)}_k$$, we can get this value from the table of computed $$s$$. Furthermore, we have $$\frac{\partial h^{(l+1)}_k}{\partial h^{(l)}_j} = \frac{\partial h^{(l+1)}_k}{x^{(l)}_j} \frac{x^{(l)}_j}{\partial h^{(l)}_j} = w^{(l+1)}_{jk} g'(h^{(l)}_j)$$. Therefore we have $$\frac{\partial E}{\partial h^{(l)}_j} = s^{(l)}_j = \sum_k s^{(l+1)}_k w^{(l+1)}_{jk} g'(h^{(l)}_j)$$.

Overall,  we can write $$\frac{\partial E}{\partial w^{(l)}_{ij}} = x^{(l-1)}_i s^{(l)}_j$$.

Since we can compute the partial derivatives on each weights, we can now establish the update rule, which is the typical usual day gradient descent: $$w^{(l)}_{ij} := w^{(l)}_{ij} - \eta \frac{\partial E}{\partial w^{(l)}_{ij}}$$.

With this all, we are ready to implement the neural networks. I did so using python (with a few convenience of NumPy and PyLab). A point of note: The error function may differs between different choice of activation function, which is natural. For softmax and logistic functions, it is preferable to minimise the log likelihood function instead : $$E = - \sum_k t_k \log { (x^{(L)}_k ) }$$. Another side note is that the neural network I've described is of the stochastic version, for which we do an update whenever we introduce an input, as opposed to the batch version. This may or may not have the characteristics that you want. The following is a skeleton implementation of the neural networks, which can be helpful for future references.

from numpy import *
from pylab import *
from numpy import random

def g(x, type):
if type == 'linear':
return x
elif type == 'tanh':
return  ( exp(x) - exp(-x) ) / ( exp(x) + exp(-x) )
elif type == 'softmax':
if (x >= 700):
return 1e300
return exp(x)

def dg(x, type):
if type == 'linear':
return 1
elif type == 'tanh':
t = g(x, type)
return 1 - t*t
elif type == 'softmax':
return 1

def train(data, target, layers, activation='tanh', iter=4000, eta=0.2, valid_input=None, valid_output=None, error_margin=0.001, print_error=False):
(data_size, input_dimension) = shape(data)
(target_size, target_dimension) = shape(target)
L = len(layers)
assert (data_size == target_size)
f = vectorize(g)
df = vectorize(dg)
w = []
h = []
s = []
x = []
x.append(zeros((input_dimension, 1)))
h.append(zeros((1, 1)))
w.append(zeros((1, 1)))
s.append(zeros((1, 1)))
for i in xrange(len(layers)):
(prev_row, prev_col) = shape(x[i])
cur_dim = layers[i]
w.append(random.rand(prev_row, cur_dim) * 0.1 - 0.05)
s.append(zeros((cur_dim, 1)))
h.append(zeros((cur_dim, 1)))
x.append(zeros((cur_dim+1, 1)))
x[i+1][cur_dim] = 1;                # initializing the bias value

seed = list(range(data_size))

validated = False
prev_error_rate = 1e300
trail_error_rate = 2e300
if valid_input is not None:
iter = 50000000     # something big
validated = True

# stochastic version of MLP
for iteration in xrange(iter):
random.shuffle(seed)
data = data[seed, :]
target = target[seed, :]
for k in xrange(data_size):
# initialize input nodes
x[0] = data[k].reshape(input_dimension, 1)

# forward pass
for layer in xrange(L):
next = layer+1
dnext, dump = shape(x[next])
h[next] = dot(transpose(x[layer]), w[next]).reshape(dnext-1, 1)
if activation == 'softmax':
normalizer = sum(exp(x[next]))
x[next] = append(f(h[next], activation) / normalizer, array([[1]]), axis=0)
else:
x[next] = append(f(h[next], activation), array([[1]]), axis=0)

# backward propagation
# initialize s of last layer
for i in xrange(target_dimension):
s[L][i] = (target[k][i] - x[L][i]) * dg(h[L][i], activation)
for i in xrange(layers[L-2]+1):
for j in xrange(target_dimension):
w[L][i][j] += eta * s[L][j] * x[L-1][i]
for layer in reversed(xrange(L)):
for j in xrange(layers[layer-1]):
s[layer][j] = 0
for k in xrange(layers[layer]):
s[layer][j] += s[layer+1][k] * w[layer+1][j][k] * dg(h[layer][j], activation)
if layer == 1:
break
for i in xrange(layers[layer-1]+1):
for j in xrange(layers[layer]):
w[layer][i][j] += eta * s[layer][j] * x[layer-1][i]

for i in xrange(input_dimension):
for j in xrange(layers[0]):
w[1][i][j] += eta * s[1][j] * x[0][i]
if validated:
current_error_rate = validate(valid_input, valid_output, w, layers, activation=activation)
if (print_error):
print current_error_rate
if ((prev_error_rate - current_error_rate) < error_margin) and ((trail_error_rate - prev_error_rate) < error_margin):
break
trail_error_rate = prev_error_rate
prev_error_rate = current_error_rate
return w

def compute(input, w, layers, activation='tanh'):
(data_size, input_dimension) = shape(input)
f = vectorize(g)
L = len(layers)
x = []
x.append(zeros((input_dimension, 1)))
for i in xrange(len(layers)):
curDim = layers[i]
x.append(zeros((curDim+1, 1)))
x[i+1][curDim] = 1

x[0] = input.reshape(input_dimension, 1)

for layer in xrange(L):
next = layer+1
dnext, dump = shape(x[next])
h = dot(transpose(x[layer]), w[next]).reshape(dnext-1, 1)
if activation == 'softmax':
normalizer = sum(exp(x[next]))
x[next] = append(f(h, activation) / normalizer, array([[1]]), axis=0)
else:
x[next] = append(f(h, activation), array([[1]]), axis=0)
return x[L][:-1]

def validate(valid_input, valid_output, w, layers, activation='tanh'):
error = 0
size, dim = shape(valid_input)
output_dim = shape(valid_output)[1]
for i in xrange(size):
y = compute(valid_input[i].reshape(1, dim), w, layers, activation)
for k in xrange(output_dim):
if activation == 'softmax':
error += -1.0 * valid_output[i][k] * log(y[k])
else:
error += 0.5*((y[k] - valid_output[i][k])**2)
return error

def confusion_matrix(test, test_target, w, layers, activation='tanh'):
output_dimension = shape(test_target)[1]
matrix = zeros((output_dimension, output_dimension))
size, input_dimension = shape(test)
percentage_error = 0;
for i in xrange(size):
y = compute(test[i].reshape(1, input_dimension), w, layers, activation=activation)
max_val = -1e300
index = -1
correct = -1
for j in xrange(output_dimension):
if y[j] > max_val:
max_val = y[j]
index = j
if test_target[i][j] == 1:
correct = j
matrix[correct][index] += 1;
if correct != index:
percentage_error += 1
percentage_error = 1.0 * percentage_error / size * 100
print matrix
print 'percentage error: ',percentage_error, '%'
return matrix