## Tuesday, November 4, 2014

### Machine Learning: Linear Regression

A small piece of machine learning: linear regression.
(More like a note to myself hehe)
(Prof. Andrew Ng of Stanford University is damn awesome!)

In linear regression, we want to come up with hypothesis $$h_{\theta} ( \vec{x} ) = \sum_{i=0}^n \theta_i x_i$$ which minimizes the least square function $$J(\theta) = \frac{1}{2} \sum_{i=0}^{m} (h_{\theta}(\vec{x}^i) - y^i)^2$$. To do so, we employ an algorithm called the gradient descent algorithm, which seeks to minimizes $$J(\theta)$$ by incrementally updating $$\theta$$ by going down the steepest gradient. For each $$j$$, we update
$$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$
until $$J(\theta)$$ converges. $$\alpha$$ is called the learning rate which determines the rate of convergence (somewhat, but if $$\alpha$$ is too big, the algorithm becomes very unstable and diverges). The algorithm is usually repeated in several (to thousands) of iterations.

If we operate the partial derivatives, we get
$$\theta_j = \theta_j - \alpha \sum_{i=0}^m (h_{\theta}(x^i) - y^i)x_j^i$$
From here we have two choices of implementing the algorithm: by batch gradient descent or by stochastic gradient descent, depending on whether we want to iterate through all m data in the training set, or we want to process each one at a time.
I think choosing the appropriate $$\alpha$$ and number of iteration is the key for this algorithm to run and output satisfactory hypothesis function.