Showing posts with label Linear Regression. Show all posts
Showing posts with label Linear Regression. Show all posts

Wednesday, April 22, 2015

Machine Learning: Linear Regression with Regularization)

In this post I would like to make a mental note on linear regressions, and the key focus will be on the algorithm for fitting the regularised version of linear regression.

Fundamentally, linear regression seeks to fit a set of data sample by determining the hyperplane (affine) that minimises the mean squared error (MSE) between the data points and the plane. In statistics we seek to find this particular parametric regression \( r(X=x) = E(Y|X = x) \), and if \( Y|X \) follows normal distribution, it can be shown that the minimum MSE is indeed the maximum likelihood estimator (MLE) of the regression.

Saturday, November 8, 2014

Machine Learning: Closed form of Linear Regression

A follow up from the previous Machine Learning Post: Machine Learning: Linear Regression

(Credit: Thanks to Stanford University for their online course and Prof. Andrew Ng for great materials and discussions. All credit goes to them. I am merely posting my thoughts for future references)

In linear regressions, we want to find \(\theta\) such that the least squares function \(J(\theta) = \frac{1}{2} \sum_i (h_{\theta}(x_i) - y_i)^2) \) is minimized. One of the way to go is by implementing the gradient descent algorithm:
for each j:
     update \(\theta_j = \theta - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)

Tuesday, November 4, 2014

Machine Learning: Linear Regression

A small piece of machine learning: linear regression.
(More like a note to myself hehe)
(Prof. Andrew Ng of Stanford University is damn awesome!)

In linear regression, we want to come up with hypothesis \(h_{\theta} ( \vec{x} ) = \sum_{i=0}^n \theta_i x_i \) which minimizes the least square function \(J(\theta) = \frac{1}{2} \sum_{i=0}^{m} (h_{\theta}(\vec{x}^i) - y^i)^2 \). To do so, we employ an algorithm called the gradient descent algorithm, which seeks to minimizes \(J(\theta)\) by incrementally updating \(\theta\) by going down the steepest gradient. For each \(j\), we update
\(\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)
until \(J(\theta)\) converges. \(\alpha\) is called the learning rate which determines the rate of convergence (somewhat, but if \(\alpha\) is too big, the algorithm becomes very unstable and diverges). The algorithm is usually repeated in several (to thousands) of iterations.