A small piece of machine learning: linear regression.

(More like a note to myself hehe)

(Prof. Andrew Ng of Stanford University is damn awesome!)

In linear regression, we want to come up with hypothesis \(h_{\theta} ( \vec{x} ) = \sum_{i=0}^n \theta_i x_i \) which minimizes the least square function \(J(\theta) = \frac{1}{2} \sum_{i=0}^{m} (h_{\theta}(\vec{x}^i) - y^i)^2 \). To do so, we employ an algorithm called the gradient descent algorithm, which seeks to minimizes \(J(\theta)\) by incrementally updating \(\theta\) by going down the steepest gradient. For each \(j\), we update

\(\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \)

until \(J(\theta)\) converges. \(\alpha\) is called the learning rate which determines the rate of convergence (somewhat, but if \(\alpha\) is too big, the algorithm becomes very unstable and diverges). The algorithm is usually repeated in several (to thousands) of iterations.

If we operate the partial derivatives, we get

\(\theta_j = \theta_j - \alpha \sum_{i=0}^m (h_{\theta}(x^i) - y^i)x_j^i \)

From here we have two choices of implementing the algorithm: by batch gradient descent or by stochastic gradient descent, depending on whether we want to iterate through all m data in the training set, or we want to process each one at a time.

I think choosing the appropriate \(\alpha\) and number of iteration is the key for this algorithm to run and output satisfactory hypothesis function.

Follow up post:

Machine Learning: Closed form Linear Regression