Gradient Descent

2022-01-12 . 3 min read

Descend Down the Paraboloid
Descend Down the Paraboloid

What is Gradient Descent?


Gradient descent is an optimization algorithm that we use in machine learning and by optimization, we mean finding the maximum or minimum value of a function and in this case, we are descending to find the minimum value.

But which function are we optimizing?

First, let's see how we train our ML model. Training a ML model means finding the value of parameters that gives correct predictions to our data.

But what are these parameters?

For the sake of simplicity, I am taking a very simple example of Linear Regression for a Single Variable. Take an example, say we want to predict the weight when given height as input and say, we gathered some data on people’s height and weight in some unit, which when plotted looks like:

Scatter Plot: Weight vs Height
Scatter Plot: Weight vs Height

Judging the plot, we can assume that the weight is almost a linear function of height.

Our hypothesis looks like this:

h(x)=θ0+θ1x1h(x) = \theta _{0} + \theta _{1} x_{1}

What we want to do is find the value of the parameters theta (θ\theta) that gives us the best fit to our data. For this case, we want to find a straight line that can approximately fit all of our data and predict values for new inputs. Below, we can see two straight lines where one fits the points more perfectly than the other.

SSt.Lines fitting the data
SSt.Lines fitting the data

The parameter vector theta for the red line is [12]\begin{bmatrix}1 \\ 2 \end{bmatrix} and for the green line is [23]\begin{bmatrix}-2 \\ 3 \end{bmatrix}. We can clearly see that the red line fits the data well but how do we train our model to differentiate what value for the parameter theta fit our data with less error.

Here, we already have the answer, we calculate the error we get in our prediction. A prediction error can be a number that tells us how far our predicted value is from the actual one, the difference. A cost function or sometimes called the error function or the loss function is used to measure the error, or we can say the penalty we have to pay when using a certain set of theta. Like in the above example, using [23]\begin{bmatrix}-2 \\ 3 \end{bmatrix} gives us a larger error than when using [12]\begin{bmatrix}1 \\ 2 \end{bmatrix}. Here, I will be using the Mean Square cost function denoted by the letter J
which gives us the Mean Squared Error (MSE ).

J(θ)=12mi=1m(h(x)(i)y(i))2J(\theta) = \dfrac {1}{2m} {\sum_{i=1}^{m}} (h(x)^{(i) } - y^{(i)})^2 \\
  • m - number of training examples
  • h(x)(i)h(x)^{(i)} - model prediction for ithi^{th} training example
  • y(i)y^{(i)} - label or the actual value for the ithi^{th} training example

Now, we use gradient descent to optimize this optimization function J(θ)J(\theta) to find the θ\theta that minimizes J. What we do in gradient descent is, we iterate and update θ\theta on every iteration until J(θ)J(\theta) converges to a small acceptable value.

θk:=θkαθkJ(θ)\theta_{k} := \theta_{k} - \alpha \dfrac{\partial}{\partial \theta_{k}} J(\theta)

Here, α\alpha is the learning rate or the step size, it tells us by how much we should step during the descent and the partial derivative gives us the direction on the axis of θk\theta _{k} which increases the cost, so we should move on the opposite direction and thus we subtract it from the previous value of θk\theta _{k} to get its new value. For an in-depth understanding of this see my blog post on Calculus Behind Gradient Descent.

θ\theta is a vector of n+1 dimensions, n being the number of features of our data. In the above example, we have only one feature, the height, to predict the weight, which means θ\theta is a 2-dimensional vector and on every iteration of gradient descent, we update θ0,θ1\theta_{0}, \theta_{1} as:

θ0:=θ0αmi=1m(h(x)(i)y(i))\theta_{0} := \theta_{0} - \dfrac{\alpha}{m} \sum_{i=1}^{m}(h(x)^{(i) } - y^{(i)})
θ1:=θ1αmi=1m(h(x)(i)y(i))x1\theta_{1} := \theta_{1} - \dfrac{\alpha}{m} \sum_{i=1}^{m}(h(x)^{(i) } - y^{(i)})x_{1}

For those who are not confident enough in calculus, to find the partial derivative of the cost function above, the power rule and chain rule are used. please refer to this and this content on Khanacademy.

Stochastic Gradient Descent


The above example is an example of batch gradient descent where on each iteration, on every update we sum over all the mm training examples but for large data sets, this can be a very computationally expensive problem. So when we have millions of training examples, we can use a different version of gradient descent where we focus on only one example on every iteration. So, for a single example ii, the cost is:

J(θ)=12(h(x)(i)y(i))2J(\theta) = \dfrac {1}{2} (h(x)^{(i) } - y^{(i)})^2

Now, we iterate through every training example and update theta on every iteration.

θk:=θkα(h(x)(i)y(i))xk\theta_{k} := \theta_{k} - \alpha (h(x)^{(i) } - y^{(i)})x_{k}

What stochastic gradient descent does differently than the batch gradient descent is instead of trying to fit all the examples a little bit better on every iteration of the algorithm, it tries to fit a single training example at a time and does the same for all training example.

Mini Batch Gradient descent


It sits between batch gradient descent and stochastic gradient descent. Here, unlike batch gradient descent and stochastic gradient descent, we sum over the training examples of some batch size b. For a batch size of 10

for i = 1,11,21....m-9
	for j = 0:n
		update theta using batch gradient descent
	end
end
θk:=θkαbj=ii+b(h(x)(j)y(j))xk\theta_{k} := \theta_{k} - \dfrac{\alpha}{b} \sum_{j=i}^{i+b}(h(x)^{(j) } - y^{(j)})x_{k}

Checking the implementation


Now, if our implementation of gradient descent is correct, the cost should decrease in every iteration of the algorithm. So, when plotting J(θ)J(\theta) vs the number of iterations, we should get a plot that looks something like this:

J (in 100s vs interations)
J (in 100s vs interations)

Here, we see that the cost function converges to a value in only about 40-50 iterations but it depends on the application, the algorithm, and the learning rate α\alpha. For the plot above, we can say we chose the perfect value of alpha not just because J(θ)J(\theta) converges but also because it converges very fast. If we choose a very big value of alpha, instead of decreasing the cost might increase or oscillate continuously

Large Alpha: J vs interations
Large Alpha: J vs interations
Large Alpha: J vs interations
Large Alpha: J vs interations

But if α\alpha is very small then it might take gradient descent too long to converge.


References



Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.