Gradient Descent
2022-01-12 . 3 min read
What is Gradient Descent?
Gradient descent is an optimization algorithm that we use in machine learning and by optimization, we mean finding the maximum or minimum value of a function and in this case, we are descending to find the minimum value.
But which function are we optimizing?
First, let's see how we train our ML model. Training a ML model means finding the value of parameters that gives correct predictions to our data.
But what are these parameters?
For the sake of simplicity, I am taking a very simple example of Linear Regression for a Single Variable. Take an example, say we want to predict the weight when given height as input and say, we gathered some data on people’s height and weight in some unit, which when plotted looks like:
Judging the plot, we can assume that the weight is almost a linear function of height.
Our hypothesis looks like this:
What we want to do is find the value of the parameters theta () that gives us the best fit to our data. For this case, we want to find a straight line that can approximately fit all of our data and predict values for new inputs. Below, we can see two straight lines where one fits the points more perfectly than the other.
The parameter vector theta for the red line is and for the green line is . We can clearly see that the red line fits the data well but how do we train our model to differentiate what value for the parameter theta fit our data with less error.
Here, we already have the answer, we calculate the error we get in our
prediction. A prediction error can be a number that tells us how far our
predicted value is from the actual one, the difference. A
which gives us the Mean Squared Error (
- m - number of training examples
- - model prediction for training example
- - label or the actual value for the training example
Now, we use gradient descent to optimize this optimization function to find the that minimizes J. What we do in gradient descent is, we iterate and update on every iteration until converges to a small acceptable value.
Here, is the learning rate or the step size, it tells us by how much we should step during the descent and the partial derivative gives us the direction on the axis of which increases the cost, so we should move on the opposite direction and thus we subtract it from the previous value of to get its new value. For an in-depth understanding of this see my blog post on Calculus Behind Gradient Descent.
is a vector of n+1 dimensions, n being the number of features of our data. In the above example, we have only one feature, the height, to predict the weight, which means is a 2-dimensional vector and on every iteration of gradient descent, we update as:
For those who are not confident enough in calculus, to find the partial derivative of the cost function above, the power rule and chain rule are used. please refer to this and this content on Khanacademy.
Stochastic Gradient Descent
The above example is an example of batch gradient descent where on each iteration, on every update we sum over all the training examples but for large data sets, this can be a very computationally expensive problem. So when we have millions of training examples, we can use a different version of gradient descent where we focus on only one example on every iteration. So, for a single example , the cost is:
Now, we iterate through every training example and update theta on every iteration.
What stochastic gradient descent does differently than the batch gradient descent is instead of trying to fit all the examples a little bit better on every iteration of the algorithm, it tries to fit a single training example at a time and does the same for all training example.
Mini Batch Gradient descent
It sits between batch gradient descent and stochastic gradient descent. Here, unlike batch gradient descent and stochastic gradient descent, we sum over the training examples of some batch size b. For a batch size of 10
for i = 1,11,21....m-9
for j = 0:n
update theta using batch gradient descent
end
end
Checking the implementation
Now, if our implementation of gradient descent is correct, the cost should decrease in every iteration of the algorithm. So, when plotting vs the number of iterations, we should get a plot that looks something like this:
Here, we see that the cost function converges to a value in only about 40-50 iterations but it depends on the application, the algorithm, and the learning rate . For the plot above, we can say we chose the perfect value of alpha not just because converges but also because it converges very fast. If we choose a very big value of alpha, instead of decreasing the cost might increase or oscillate continuously
But if is very small then it might take gradient descent too long to converge.
References
- Neural Network and Deep Learning Book - Micheal Nielsen
- Neural Network Video Series - 3b1b
- The Internet
Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.