Wednesday, April 22, 2015

Machine Learning: Linear Regression with Regularization)

In this post I would like to make a mental note on linear regressions, and the key focus will be on the algorithm for fitting the regularised version of linear regression.

Fundamentally, linear regression seeks to fit a set of data sample by determining the hyperplane (affine) that minimises the mean squared error (MSE) between the data points and the plane. In statistics we seek to find this particular parametric regression \( r(X=x) = E(Y|X = x) \), and if \( Y|X \) follows normal distribution, it can be shown that the minimum MSE is indeed the maximum likelihood estimator (MLE) of the regression.

More concisely, given a set of data points \( (x^{(i)}, y^{(i)}) \) where \(x\) is a vector and \(y \in \mathbb{R} \), we seek to find \(w\) that minimises \(\text{MSE} = \frac{1}{2} \sum_i (y^{(i)} - w^T x^{(i)}) \). If we represent all \( x^{(i)}, y^{(i)} \) into a matrix \(X\) and \(Y\) respectively, there is a closed form solution to \(w\): \(w = (X^TX)^{-1}X^TY\).

However the above formulation has a problem of overfitting, since it will tend to fit the noise as well. Hence a model selection method called Ridge and Lasso can be used to regulate (i.e., a regularization) the regressor to penalise overfitting. 

Ridge is done by placing an extra penalising term \( \lambda w^Tw \) in the MSE: \(\text{MSE} = \frac{1}{2} \sum_i (y^{(i)} - w^T x^{(i)}) + \lambda \sum_j w_j^2 \). This is also called the L2 Penalised Regression Model. There is a closed form solution to this: \(w = (X^TX + \lambda I)^{-1}X^TY \).

Lasso is slightly different, it adds the sum of absolute value of \(w_j\) as the penalising term: \(\text{MSE} = \frac{1}{2} \sum_i (y^{(i)} - w^T x^{(i)}) + \lambda \sum_j |w_j| \). This is also called the L1 Penalised Regression Model. In general there is no closed form formula for Lasso, but there is one when X is orthonormal.

Ridge and Lasso have differing behaviour since their search space are inherently different. Usually it is beneficial to combine the two, called an elastic net:
\(\text{MSE} = \frac{1}{2}\sum_i (y^{(i)} - w^T x^{(i)}) + \lambda ( \alpha  \sum_j |w_j|  + (1 - \alpha) \sum_j w_j^2) \). \(\alpha\) acts as the modifier that helps us to tune the behaviour of the regressor between Ridge-like and Lasso-like.

To fit the model, I will describe a method called the Coordinate Descent, which is applicable to convex models such as this. In fact this method is also used to fit support vector machine (SVM), for which it is called the sequential minimum optimisation (SMO) method. I think coordinate descent is the main idea behind the glmnet package. 

Firstly, let all \(w_i\) be fixed, except for an index \(j\). Then we have:
\( \frac{ \partial \text{MSE} } { \partial w_j } = \sum_i -x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) + w_j\sum_i [x^{(i)}_j ]^2 + \lambda ( \alpha \frac{ \partial } {\partial w_j} |w_j| + (1 - \alpha) w_j ) \)
(I know it is a blasphemy to differentiate \(|w_j| \), but it makes sense in terms of sub differentials)
Here, the notation \(v_{-j}\) means the whole vector v but leaving out index j only.
The MSE is minimised when \( \frac{ \partial \text{MSE} } { \partial w_j } = 0 \).
Simplifying, we get:  \(w_j = \frac{ \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) - \lambda \alpha \frac{ \partial } {\partial w_j} |w_j| }{ \sum_i [x^{(i)}_j ]^2  + \lambda (1-\alpha)} \)

Then depending on the following cases:
Case 1: if \(w_j > 0\) and \( \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) > \lambda \alpha \) we have \(w_j = \frac{ \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) - \lambda \alpha }{ \sum_i [x^{(i)}_j ]^2  + \lambda (1-\alpha)} \)
Case 2: if \(w_j < 0\) and \( \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) < -\lambda \alpha \) we have \(w_j = \frac{ \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) + \lambda \alpha }{ \sum_i [x^{(i)}_j ]^2  + \lambda (1-\alpha)} \)
Case 3: Otherwise, \(-\lambda \alpha < \sum_i x^{(i)}{j} (y^{(i)} - w_{-j}^T x^{(i)}_{-j}) < \lambda \alpha \) and \(w_j = 0\).

By the above update rules for each \(w_j\), the coordinate descent algorithm is ready and we iterate those update rules until convergence.

Following is the implementation in Python (not optimised in any way yet):

from numpy import *
from pylab import *

class LinearRegression:
    def __init__(self, alpha, regulator):
        self.alpha = alpha
        self.regulator = regulator
        self.convergenceRate = 0.001

    def setSample(self, data, target): = data = target
        dim = shape(data)[1]
        self.weight = zeros((dim, 1))

    def updateError(self):
        (size, dim) = shape(
        self.error = 0
        for i in xrange(size):
            t = 0
            for j in xrange(dim):
                t += self.weight[j] *[i][j]
            dy = ([i] - t)
            self.error += dy * dy

    def train(self):
        (size, dim) = shape(
        converged = False
        j = 0
        while not converged:
            Sj = 0
            Mj = 0
            prevError = self.error
            for i in xrange(size):
                t = 0
                for k in xrange(dim):
                    if k == j:
                    t += self.weight[k] *[i][k]
                Sj +=[i][j] * ([i] - t)
                Mj +=[i][j] *[i][j]
            r = Mj + self.regulator * (1 - self.alpha)
            u = self.regulator * self.alpha
            if Sj > u:
                self.weight[j] = (Sj - u) / r
            elif Sj < -u:
                self.weight[j] = (Sj + u) / r
                self.weight[j] = 0

            curError = self.error
            if abs(curError - prevError) < self.convergenceRate:
                converged = True
            j = (j+1) % dim

    def regress(self, x):
        dim = shape(self.weight)[0]
        t = 0
        for i in xrange(dim):
            t += self.weight[i] *  x[i]
        return t