Calculus Behind Gradient Descent

2022-02-12 . 6 min read

Derivative and Partial Derivative Intuition


We’ve always heard in college lectures that derivative gives us the rate of change or gives us the slope of the function or is the rise over run and all. We all know these mean the same thing which tells us how fast or slow the output changes when its input changes.

Derivative of Function of Single Variable


A straight line with the equation: f(x) = y=2xy = 2x has a constant slope of 22 which means for any point on the straight line, changing the value of xx by some amount, would result in 2 times the change in x as the change in y. This is rather a simple example, let's dive into a more complicated one. Consider the equation of a parabola: y=x2y = x^2 whose plot looks like:

Parabola: y = x ^ 2
Parabola: y = x ^ 2

Now, while you continue reading, imagine you are standing at the base or the minima of the parabola at x=0x = 0 at which the derivative dy/dx=0dy/dx = 0, which means even if you step a little bit towards the right or left i.e. in the xx direction, you would not have moved much in the yy direction. Notice that I wrote not moved much and not zero even though the change in yy given by the derivative dy/dx=2xdy/dx = 2x at x=0x = 0 must be zero. This is because, the definition of derivative from the first principle is the limiting value of rise - f(x+h)f(x)f(x +h) - f(x) over run - hh as hh →0.

dydx=limh0f(x+h)f(x)h\dfrac{dy}{dx} = \lim_{h \to 0} \dfrac{f(x + h) - f(x)}{h}

To be more clear, say you are still at x=0x = 0 and you move a little bit towards the right by an amount 0.01, so the change in xx is 0.01, the change in y is f(0+0.01)f(0)=0.00010=0.001f(0 + 0.01) - f(0) = 0.0001 - 0 = 0.001. Hence, the derivative, rise over run is 0.001/0.01=0.10.001/0.01 = 0.1.

Now as we decrease hh i.e. the change in xx, the value of the derivative moves closer and closer to 0. Lets take another example with smaller value of h=0.0001h = 0.0001, the change in yy is f(0+0.0001)f(0)=0.00000001f(0 + 0.0001) - f(0) = 0.00000001, so rise over run is 0.00000001/0.0001=0.0010.00000001/0.0001 = 0.001. Thus, we see that the value for rise/run gets closer and closer to 0 as we decrease the change in xx. So, we have to understand that the derivative is the limiting value of the ratio rise/run as we decrease the run or change in xx.

Now, let's assume that you climbed up the curve and reached the point x=2x = 2, the curve is a little bit steep at this point than it was at x=0x = 0 and has a slope of 44 which means if you increase your step by a little amount from here, say you took a step h=0.001h = 0.001, then you would rise by an amount of 44 x 0.0010.001 = and as you move up, the steepness increases and even a small step in xx direction would cause you to rise by a significant amount in yy direction.

Derivative of Function of Multiple Variables


Let's jump towards the function of multiple variables, and since we were walking on a parabola before, I thought it would be interesting to walk on a paraboloid now. Below we can see the shape of a paraboloid with equation: f(x,y)=x2+y2f(x, y) = x^2 + y^2.

Plot of: f(x, y) = x^2 + y^2
Plot of: f(x, y) = x^2 + y^2

Now, how do we calculate the derivative of multivariable functions?

Here, partial derivatives come to our rescue. For a function of multiple variables, we differentiate the function with respect to each of the variables pretending all other variables as constant. So for a function of n variables, we get n partial derivatives. Hence, for a paraboloid which is a function of 2 variables, we get 2 partial derivatives: one with respect to xx and the other with respect to yy.

x(x2+y2)=2xy(x2+y2)=2y\dfrac{\partial}{\partial x} (x^2 + y^2) = 2x \hspace{1cm} \dfrac{\partial}{\partial y} (x^2 + y^2) = 2y

Understanding Partial Derivatives


Khanacademy has a brilliant content on the topic here.

Understand that the derivative of a function of a single variable is the building block of the derivative of a function of multiple variables. From the first principles, the partial derivative can be defined as:

x(x2+y2)=limΔx0f(x+Δx,y)f(x,y)Δx\dfrac{\partial}{\partial x} (x^2 + y^2) = \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x, y) - f(x, y)}{\Delta x}

y(x2+y2)=limΔy0f(x,y+Δy)f(x,y)Δy\dfrac{\partial}{\partial y} (x^2 + y^2) = \lim_{\Delta y \to 0} \dfrac{f(x, y + \Delta y) - f(x, y)}{\Delta y}

While understanding partial derivatives graphically, when we say pretend yy as a constant when differentiating the function with respect to xx, imagines slicing the plot of the function with a plane such that the plane is parallel to xaxisx - axis and perpendicular to yaxisy-axis, and thus for every point on the plane, yy would have the constantvalue. Now, increase your visualizing capacity and imagine what would we see on the cross-section when this plane slices off the paraboloid 🤔. We would see a parabola. For every point on the parabola, y=0y = 0. If we sliced the paraboloid with a plane parallel to the xx -axis, intersecting yy -axisaxis at y=6y = 6, then for every point on the parabola at the sliced section we would get sets of points like (x, 6). So the slope of this parabola would depend only on the changing values of xx and even if that curve had a mind of its own and wanted its slope to be dependent on yy, it would never depend on it because the sliced parabola is stuck with only one value of y.

Now, the derivative of the function with respect to all the variables is:

df(x,y)d(x,y)=fx+fy\dfrac{df(x, y)}{d(x, y)} = \dfrac{\partial f}{\partial x} + \dfrac{\partial f}{\partial y}

or more formally, the change in function with small changes in the variables is:

df(x,y)=fxdx+fydydf(x, y) = \dfrac{\partial f}{\partial x} dx + \dfrac{\partial f}{\partial y} dy

For the paraboloid above, at point (1,2), the value of the function is f(x,y)(1,2)=x2+y2=5f(x,y)|_{(1,2)} = x^2 + y^2 = 5, for small changes dx=0.001,dy=0.001dx = 0.001, dy = 0.001, the change in the function is df=55.006002=0.006002df = 5 - 5.006002 = 0.006002. Now, the derivative of the function according to the rules of partial derivative is df=2xdx+2ydy=210.001+220.001=0.0060.00602df = 2xdx + 2ydy = 2*1*0.001 + 2*2*0.001 = 0.006 \approx 0.00602

Gradient of a function


I won’t be going into much depth about Directional Derivative and Vector Fields but those who want to understand them in a more intuitive manner check out this → Directional Derivative on Khanacademy.

As Mr. Grant Sanderson says in **this** video on Khanacademy

Gradient is a way of packing all the partial derivative information of a function.

We use nabla, the upside-down delta symbol → \nabla to represent the gradient operator and, the gradient's output, when operated on a function ff, is a vector of partial derivatives of a function with respect to all its variables. Let's work with a function of 22 variables for now.

f(x,y)=[ xf(x,y) yf(x,y)]\nabla f(x,y) = \begin{bmatrix} \ \dfrac{\partial}{\partial x}f(x,y) \\\\\ \dfrac{\partial}{\partial y}f(x, y)\end{bmatrix}

For a function of nn variables, gradient outputs us a vector of nn dimensions. We know, we can image a vector as an arrow pointing at a specific direction and its length giving us its magnitude, but where does this vector given by the gradient point to?

This is where things get interesting, the gradient vector points to the direction to the maximum change in function from the point (x, y) where the gradient was calculated or the direction of the steepest ascent. Let me help you imagine this.

Consider, the following function f(x,y)=x2y2f(x, y) = x^2 - y^2.

Plot of: f(x, y) = x^2 + y^2
Plot of: f(x, y) = x^2 + y^2

The gradient of this vector is :

f(x,y)=[2x2y]\nabla f(x, y) = \begin{bmatrix} 2x \\ -2y\end{bmatrix}

The gradient of the function at (x,y)=(1,2)(x, y) = (1,2) is:

f(1,2)=[34]\nabla f(1, 2) = \begin{bmatrix} 3 \\ -4\end{bmatrix}

Below we can see a contour plot of the above function and the gradient vector at point (x,y)=(1,2)(x,y) = (1,2). A contour plot shows the shape of the cross-section of the function when sliced by a plane parallel to xyx-y plane for a function of 2 variables, every point (x,y)(x,y) on a specific locus on the contour, gives us the same value of the function. For better intuition and further understanding, this video is recommended.

Contour Plot Showing gradient vetor at (1,2)
Contour Plot Showing gradient vetor at (1,2)

So, we can say that each element in the gradient vector tells us the direction and distance to move along the axis of that component. For, the above example, the value of gradient vector at (1,2) was [34]\begin{bmatrix} 3 \\ -4\end{bmatrix}, so from that point (1,2), we move 3 units in the positive xx direction and 4 units in negative yy to ascend by the maximum amount.

If you are still confused with all this, imagine a function of single variable, imagining a parabola y=x2y = x^2 which has a derivative f(x)=2xf\prime (x) = 2x and for positive and negative values of xx, calculate the value of the derivative and notice which direction would the gradient vector move along xaxisx-axis which increases the function.

Vector Fields


We know that the gradient is a vector of the partial derivatives and when we draw the downscaled gradient vector for every point on the function, we get a vector field where each vector points to the direction of the steepest ascent. Below, we can see the gradient vector field for the function f(x,y)=xyf(x, y) = xy.

Gradient Vector Field for f(x,y) = xy
Gradient Vector Field for f(x,y) = xy

The above image is clipped from here

Directional Derivative


Suppose you are climbing a hill, at a certain arbitrary point on the hill, there are many directions you could take to climb the hill and for every different direction you take, you ascend by a different amount. We can calculate this amount of change in different directions by calculating the derivative along all these directions using directional derivative but among all these, there’s one specific direction that gives us the maximum change. As already stated this direction is the direction pointed out by the gradient vector. The directional derivative is calculated from the dot product of the gradient and a directional vector. To understand why actually the gradient points to the direction of maximum change or steepest ascent, this content from Khanacademy is recommended.

Applying it all to ML


In ML, we use an optimization function which we call the cost function. The cost function is a function of the learning parameters thetatheta or the weightsweights. A simple cost function used to calculate the cost of prediction by a model for a value of thetatheta is the mean square cost function:

J(θ)=1mi=1m(hθ(x)(i)y(i))2J(\theta) = \dfrac{1}{m} {\sum_{i=1}^{m}} (h_{\theta}(x)^{(i)} - y^{(i)})^2

We want the cost we pay for our prediction i.e the measure of the difference in our predicted value from the actual value to be as small as possible. To get the minimum value for this cost function as much as we can, we can use the same theory of gradient which gives us the direction of maximum ascent but instead of ascending we go in the opposite direction and descent at the minimum value i.e. at the minima of the function. So, we iterate a number of times, on each iteration we find the gradient, we descend, and update theta with a new set of values until the cost converges to an optimum value and the ideal goal is to reach the minima where the cost is zero. Gradient descent is applied as:

θk:=θkαθkJ(θ)\theta_{k} := \theta_{k} - \alpha \dfrac{\partial}{\partial \theta_{k}} J(\theta)

We know the gradient gives a vector pointing to the direction of the maximum increase of function, so each component of the vector must also point in the direction of maximum change along the axis. The cost function is a function of the parameters θ\theta, so what we are doing in gradient descent is for every parameter θk\theta _{k}, let's consider the locus of the cost function along the axis of θk\theta _{k} is a parabola (for a paraboloid function).

Parabola: J = theta^2
Parabola: J = theta^2

The derivative is 2θ2\theta, so for positive thetatheta, the derivative tells us to increase thetatheta for increasing JJ, and for negative thetatheta, the derivative tells us to decrease thetatheta for increasing JJ, but what we want is the decrease of J, so notice the “-” sign in the implementation of gradient descent above, that means, move in the opposite direction to that of the gradient and the value α\alpha scales up the magnitude of movement along the direction. Since we are scaling up our step and not moving by the same amount of the gradient, we do not move in a smooth straight line towards the minima but take zig-zag but bigger steps than the gradient which can be seen on the contour plot below considering that the cost function is a paraboloid.

Descending to minima with gradient descent
Descending to minima with gradient descent

References



Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.