In logistic regression, \(P(G=k|X=x)\) is modelled as follows:

For k = 1 to K-1,

\[ P(G=k|X=x) = \frac{ \exp \left\{ \beta_k^Tx \right\} } {1 + \sum_{i=1}^{K-1} \exp \left\{ \beta_i^Tx \right\} } \]

and for the final class,

\[ P(G=K|X=x) = \frac{ 1 } {1 + \sum_{i=1}^{K-1} \exp \left\{ \beta_i^Tx \right\} } \]

Here \(x\) and \(\beta_k\) are all \(d+1\)-vectors, where \(d\) is the dimension of input data, and the additional 1 parameter is for the intercept term (or bias).

Let \( \beta \) be an array which combines all the \( \beta_k \), i.e.

\[ \beta^T = \{ \beta_1^T, \beta_2^T, \ldots, \beta_{K-1}^T \} \]

Furthermore let \(p_k(x) = P(G=k|X=x) \). We define the log-likelihood function \(l(\beta)\) as follows

\[ l(\beta) = \sum_{i=1}^N \log \left \{ p_{g_i}(x_i) \right \} \]

where \(g_i\) is the label of data point \(x_i\).

To minimise \(l(\beta)\), we use Newton-Raphson method, which can be stated simply as:

\[ \beta^{\text{new}} = \beta^{\text{old}} - \left[ \frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} \right] ^{-1} \frac{\partial l(\beta)}{\partial \beta} \]

It is a fun exercise to derive the first-order partial derivative and the Hessian matrix of \(l(\beta)\). The result of the labor will be:

\[ \frac{ \partial l(\beta) }{ \partial \beta_k } = \sum_{i=1}^N x_i (\mathbb{1}(g_i = k) - p_k(x_i)) \]

for all k = 1 to K-1, and where \(\mathbb{1}(g_i = k) \) is the indicator function which equates to 1 if \(g_i = k\) and 0 otherwise.

Also,

\[ \frac{\partial^2 l(\beta)}{\partial \beta_{jq} \partial \beta_{kp} } = - \sum_{i=1}^N x_{ip}x_{iq} p_k(x_i) (\mathbb{1}(j=k) - p_j(x_i)) \]

for the combination of (j, q) and (k, p) where j, k = 1..K-1 and q,p = 1..d+1.

Note that if we let the Hessian matrix H be

\[H = \frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} \]

then we have the following relationship:

\[[j(d+1)+q, k(d+1)+p]\text{ - entry of H} = \frac{\partial^2 l(\beta)}{\partial \beta_{jq} \partial \beta_{kp} } \].

Using all these results, we can implement a simple Newton-Raphson algorithm for the learning phase of the logistic regression. The above equations can be expressed in neat matrix forms which can be found in TESL. Below is one possible implementation using C++.

One point regarding the implementation issue is about the numerical stability of exp and log functions. To tackle the erratic nature of exp function, there is in fact a technique (or idiom) called log of sum of exponentiation which simply makes use of the fact that we can factorise large exponential terms and deal with them separately (usually by applying log to them, or by cancelling them out of common denominator).

It is commendable how fast Newton-Raphson converges. For a linearly separable training problems, logistic regression will give similar results to separating hyperplane methods akin to SVM according to TESL. It makes less assumption then its similar cousin LDA (linear discriminant model, which assumes that the underlying distribution of the input data is based on Gaussian distribution).