Showing posts with label Machine Learning. Show all posts
Showing posts with label Machine Learning. Show all posts

Friday, July 10, 2015

Machine Learning: Logistic Regression

Logistic Regression is one of the linear classifier. Here I would like to describe the process of implementing a multi-class linear regression learning algorithm.

In logistic regression, \(P(G=k|X=x)\) is modelled as follows:
For k = 1 to K-1, 
\[  P(G=k|X=x) = \frac{ \exp \left\{ \beta_k^Tx  \right\} } {1 + \sum_{i=1}^{K-1} \exp \left\{ \beta_i^Tx \right\} }   \]
and for the final class,
\[ P(G=K|X=x) = \frac{ 1 } {1 + \sum_{i=1}^{K-1} \exp \left\{ \beta_i^Tx \right\} } \]
Here \(x\) and \(\beta_k\) are all \(d+1\)-vectors, where \(d\) is the dimension of input data, and the additional 1 parameter is for the intercept term (or bias).

Let \( \beta \) be an array which combines all the \( \beta_k \), i.e.
\[ \beta^T = \{ \beta_1^T, \beta_2^T, \ldots, \beta_{K-1}^T   \}   \]

Saturday, May 2, 2015

Machine Learning: Logistic Regression for Classification

Well, today I spent significant amount of time teaching my Mac to distinguish 0 from 1. The motivation was mostly to check my own understanding of logistic regression.

Logistic regression model for two classes can be summarised as follows:
Let \(P(Y=1 | X=x) = h_w(x) \) where \(h_w(x) = g(w^Tx)\) and \(g(x) = \frac{1}{1+e^{-x}}\). The function \(g(x)\) is called a sigmoid function which takes value between 0 and 1. Naturally we have \(P(Y=0|X=x) = 1-h_w(x)\), since we are assuming only two classes. It can be used to map a real value to a probability space, and it has many characteristics that is good for this purpose.

Wednesday, April 22, 2015

Machine Learning: Linear Regression with Regularization)

In this post I would like to make a mental note on linear regressions, and the key focus will be on the algorithm for fitting the regularised version of linear regression.

Fundamentally, linear regression seeks to fit a set of data sample by determining the hyperplane (affine) that minimises the mean squared error (MSE) between the data points and the plane. In statistics we seek to find this particular parametric regression \( r(X=x) = E(Y|X = x) \), and if \( Y|X \) follows normal distribution, it can be shown that the minimum MSE is indeed the maximum likelihood estimator (MLE) of the regression.

Friday, April 10, 2015

Machine Learning: Support Vector Machine

Have been finding time to understand the concept behind support vector machine (SVM) and finally implemented the learning algorithm in Python. I'll write this post to review my understanding so far.

Support vector machine is in its simplest form is a type of linear classifier. The intuition behind SVM is usually demonstrated in various machine learning books by asking readers this question: amongst several correct linear classifiers, which one would you think is the best? And it is highlighted that we will tend to prefer the one which has bigger margin between the points closest to the classifier hyper plane. This is then formalised mathematically, which in the end reducible as a type of convex optimisation in particular solvable using a quadratic programming algorithm.

Tuesday, March 31, 2015

Machine Learning - Neural Networks

I was spending the whole night yesterday to understand and implement fully in principle the underlying mechanism of neural networks, one of the techniques in machine learning. And not without significance, because the introduction of neural networks was the turning point in the history which got machine learning popular. While now it is not as preferred a method as compared to support vector machines (SVM) with kernel, radial base functions (RBF), and other newer methods, I would say that the attractiveness of this model is in the simplicity of which it can be implemented, and also in the kind of intuition that it offers. This post will server as a checking point for me, in which I will narrate and describe mathematically what I understand about the mechanism.

Saturday, November 8, 2014

Machine Learning: Closed form of Linear Regression

A follow up from the previous Machine Learning Post: Machine Learning: Linear Regression

(Credit: Thanks to Stanford University for their online course and Prof. Andrew Ng for great materials and discussions. All credit goes to them. I am merely posting my thoughts for future references)

In linear regressions, we want to find \(\theta\) such that the least squares function \(J(\theta) = \frac{1}{2} \sum_i (h_{\theta}(x_i) - y_i)^2) \) is minimized. One of the way to go is by implementing the gradient descent algorithm:
for each j:
     update \(\theta_j = \theta - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)

Tuesday, November 4, 2014

Machine Learning: Linear Regression

A small piece of machine learning: linear regression.
(More like a note to myself hehe)
(Prof. Andrew Ng of Stanford University is damn awesome!)

In linear regression, we want to come up with hypothesis \(h_{\theta} ( \vec{x} ) = \sum_{i=0}^n \theta_i x_i \) which minimizes the least square function \(J(\theta) = \frac{1}{2} \sum_{i=0}^{m} (h_{\theta}(\vec{x}^i) - y^i)^2 \). To do so, we employ an algorithm called the gradient descent algorithm, which seeks to minimizes \(J(\theta)\) by incrementally updating \(\theta\) by going down the steepest gradient. For each \(j\), we update
\(\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)
until \(J(\theta)\) converges. \(\alpha\) is called the learning rate which determines the rate of convergence (somewhat, but if \(\alpha\) is too big, the algorithm becomes very unstable and diverges). The algorithm is usually repeated in several (to thousands) of iterations.