Calculus Behind Gradient Descent

2022-02-12 . 6 min read

Derivative and Partial Derivative Intuition

We’ve always heard in college lectures that derivative gives us the rate of change or gives us the slope of the function or is the rise over run and all. We all know these mean the same thing which tells us how fast or slow the output changes when its input changes.

Derivative of Function of Single Variable

A straight line with the equation: f(x) = $y = 2x$ has a constant slope of $2$ which means for any point on the straight line, changing the value of $x$ by some amount, would result in 2 times the change in x as the change in y. This is rather a simple example, let's dive into a more complicated one. Consider the equation of a parabola: $y = x^2$ whose plot looks like:

Now, while you continue reading, imagine you are standing at the base or the minima of the parabola at $x = 0$ at which the derivative $dy/dx = 0$ , which means even if you step a little bit towards the right or left i.e. in the $x$ direction, you would not have moved much in the $y$ direction. Notice that I wrote not moved much and not zero even though the change in $y$ given by the derivative $dy/dx = 2x$ at $x = 0$ must be zero. This is because, the definition of derivative from the first principle is the limiting value of rise - $f(x +h) - f(x)$ over run - $h$ as $h$ →0.

\dfrac{dy}{dx} = \lim_{h \to 0} \dfrac{f(x + h) - f(x)}{h}

To be more clear, say you are still at $x = 0$ and you move a little bit towards the right by an amount 0.01, so the change in $x$ is 0.01, the change in y is $f(0 + 0.01) - f(0) = 0.0001 - 0 = 0.001$ . Hence, the derivative, rise over run is $0.001/0.01 = 0.1$ .

Now as we decrease $h$ i.e. the change in $x$ , the value of the derivative moves closer and closer to 0. Lets take another example with smaller value of $h = 0.0001$ , the change in $y$ is $f(0 + 0.0001) - f(0) = 0.00000001$ , so rise over run is $0.00000001/0.0001 = 0.001$ . Thus, we see that the value for rise/run gets closer and closer to 0 as we decrease the change in $x$ . So, we have to understand that the derivative is the limiting value of the ratio rise/run as we decrease the run or change in $x$ .

Now, let's assume that you climbed up the curve and reached the point $x = 2$ , the curve is a little bit steep at this point than it was at $x = 0$ and has a slope of $4$ which means if you increase your step by a little amount from here, say you took a step $h = 0.001$ , then you would rise by an amount of $4$ x $0.001$ = and as you move up, the steepness increases and even a small step in $x$ direction would cause you to rise by a significant amount in $y$ direction.

Derivative of Function of Multiple Variables

Let's jump towards the function of multiple variables, and since we were walking on a parabola before, I thought it would be interesting to walk on a paraboloid now. Below we can see the shape of a paraboloid with equation: $f(x, y) = x^2 + y^2$ .

Now, how do we calculate the derivative of multivariable functions?

Here, partial derivatives come to our rescue. For a function of multiple variables, we differentiate the function with respect to each of the variables pretending all other variables as constant. So for a function of n variables, we get n partial derivatives. Hence, for a paraboloid which is a function of 2 variables, we get 2 partial derivatives: one with respect to $x$ and the other with respect to $y$ .

\dfrac{\partial}{\partial x} (x^2 + y^2) = 2x \hspace{1cm} \dfrac{\partial}{\partial y} (x^2 + y^2) = 2y

Understanding Partial Derivatives

Khanacademy has a brilliant content on the topic here.

Understand that the derivative of a function of a single variable is the building block of the derivative of a function of multiple variables. From the first principles, the partial derivative can be defined as:

\dfrac{\partial}{\partial x} (x^2 + y^2) = \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x, y) - f(x, y)}{\Delta x}

\dfrac{\partial}{\partial y} (x^2 + y^2) = \lim_{\Delta y \to 0} \dfrac{f(x, y + \Delta y) - f(x, y)}{\Delta y}

While understanding partial derivatives graphically, when we say pretend $y$ as a constant when differentiating the function with respect to $x$ , imagines slicing the plot of the function with a plane such that the plane is parallel to $x - axis$ and perpendicular to $y-axis$ , and thus for every point on the plane, $y$ would have the constantvalue. Now, increase your visualizing capacity and imagine what would we see on the cross-section when this plane slices off the paraboloid 🤔. We would see a parabola. For every point on the parabola, $y = 0$ . If we sliced the paraboloid with a plane parallel to the $x$ -axis, intersecting $y$ - $axis$ at $y = 6$ , then for every point on the parabola at the sliced section we would get sets of points like (x, 6). So the slope of this parabola would depend only on the changing values of $x$ and even if that curve had a mind of its own and wanted its slope to be dependent on $y$ , it would never depend on it because the sliced parabola is stuck with only one value of y.

Now, the derivative of the function with respect to all the variables is:

\dfrac{df(x, y)}{d(x, y)} = \dfrac{\partial f}{\partial x} + \dfrac{\partial f}{\partial y}

or more formally, the change in function with small changes in the variables is:

df(x, y) = \dfrac{\partial f}{\partial x} dx + \dfrac{\partial f}{\partial y} dy

For the paraboloid above, at point (1,2), the value of the function is $f(x,y)|_{(1,2)} = x^2 + y^2 = 5$ , for small changes $dx = 0.001, dy = 0.001$ , the change in the function is $df = 5 - 5.006002 = 0.006002$ . Now, the derivative of the function according to the rules of partial derivative is $df = 2xdx + 2ydy = 2*1*0.001 + 2*2*0.001 = 0.006 \approx 0.00602$

Gradient of a function

I won’t be going into much depth about Directional Derivative and Vector Fields but those who want to understand them in a more intuitive manner check out this → Directional Derivative on Khanacademy.

As Mr. Grant Sanderson says in **this** video on Khanacademy

Gradient is a way of packing all the partial derivative information of a function.

We use nabla, the upside-down delta symbol → $\nabla$ to represent the gradient operator and, the gradient's output, when operated on a function $f$ , is a vector of partial derivatives of a function with respect to all its variables. Let's work with a function of $2$ variables for now.

\nabla f(x,y) = \begin{bmatrix} \ \dfrac{\partial}{\partial x}f(x,y) \\\\\ \dfrac{\partial}{\partial y}f(x, y)\end{bmatrix}

For a function of $n$ variables, gradient outputs us a vector of $n$ dimensions. We know, we can image a vector as an arrow pointing at a specific direction and its length giving us its magnitude, but where does this vector given by the gradient point to?

This is where things get interesting, the gradient vector points to the direction to the maximum change in function from the point (x, y) where the gradient was calculated or the direction of the steepest ascent. Let me help you imagine this.

Consider, the following function $f(x, y) = x^2 - y^2$ .

The gradient of this vector is :

\nabla f(x, y) = \begin{bmatrix} 2x \\ -2y\end{bmatrix}

The gradient of the function at $(x, y) = (1,2)$ is:

\nabla f(1, 2) = \begin{bmatrix} 3 \\ -4\end{bmatrix}

Below we can see a contour plot of the above function and the gradient vector at point $(x,y) = (1,2)$ . A contour plot shows the shape of the cross-section of the function when sliced by a plane parallel to $x-y$ plane for a function of 2 variables, every point $(x,y)$ on a specific locus on the contour, gives us the same value of the function. For better intuition and further understanding, this video is recommended.

Contour Plot Showing gradient vetor at (1,2)

So, we can say that each element in the gradient vector tells us the direction and distance to move along the axis of that component. For, the above example, the value of gradient vector at (1,2) was $\begin{bmatrix} 3 \\ -4\end{bmatrix}$ , so from that point (1,2), we move 3 units in the positive $x$ direction and 4 units in negative $y$ to ascend by the maximum amount.

If you are still confused with all this, imagine a function of single variable, imagining a parabola $y = x^2$ which has a derivative $f\prime (x) = 2x$ and for positive and negative values of $x$ , calculate the value of the derivative and notice which direction would the gradient vector move along $x-axis$ which increases the function.

Vector Fields

We know that the gradient is a vector of the partial derivatives and when we draw the downscaled gradient vector for every point on the function, we get a vector field where each vector points to the direction of the steepest ascent. Below, we can see the gradient vector field for the function $f(x, y) = xy$ .

The above image is clipped from here

Directional Derivative

Suppose you are climbing a hill, at a certain arbitrary point on the hill, there are many directions you could take to climb the hill and for every different direction you take, you ascend by a different amount. We can calculate this amount of change in different directions by calculating the derivative along all these directions using directional derivative but among all these, there’s one specific direction that gives us the maximum change. As already stated this direction is the direction pointed out by the gradient vector. The directional derivative is calculated from the dot product of the gradient and a directional vector. To understand why actually the gradient points to the direction of maximum change or steepest ascent, this content from Khanacademy is recommended.

Applying it all to ML

In ML, we use an optimization function which we call the cost function. The cost function is a function of the learning parameters $theta$ or the $weights$ . A simple cost function used to calculate the cost of prediction by a model for a value of $theta$ is the mean square cost function:

J(\theta) = \dfrac{1}{m} {\sum_{i=1}^{m}} (h_{\theta}(x)^{(i)} - y^{(i)})^2

We want the cost we pay for our prediction i.e the measure of the difference in our predicted value from the actual value to be as small as possible. To get the minimum value for this cost function as much as we can, we can use the same theory of gradient which gives us the direction of maximum ascent but instead of ascending we go in the opposite direction and descent at the minimum value i.e. at the minima of the function. So, we iterate a number of times, on each iteration we find the gradient, we descend, and update theta with a new set of values until the cost converges to an optimum value and the ideal goal is to reach the minima where the cost is zero. Gradient descent is applied as:

\theta_{k} := \theta_{k} - \alpha \dfrac{\partial}{\partial \theta_{k}} J(\theta)

We know the gradient gives a vector pointing to the direction of the maximum increase of function, so each component of the vector must also point in the direction of maximum change along the axis. The cost function is a function of the parameters $\theta$ , so what we are doing in gradient descent is for every parameter $\theta _{k}$ , let's consider the locus of the cost function along the axis of $\theta _{k}$ is a parabola (for a paraboloid function).

The derivative is $2\theta$ , so for positive $theta$ , the derivative tells us to increase $theta$ for increasing $J$ , and for negative $theta$ , the derivative tells us to decrease $theta$ for increasing $J$ , but what we want is the decrease of J, so notice the “-” sign in the implementation of gradient descent above, that means, move in the opposite direction to that of the gradient and the value $\alpha$ scales up the magnitude of movement along the direction. Since we are scaling up our step and not moving by the same amount of the gradient, we do not move in a smooth straight line towards the minima but take zig-zag but bigger steps than the gradient which can be seen on the contour plot below considering that the cost function is a paraboloid.

Descending to minima with gradient descent

References

Khanacademy Multivariabe Derivatives
Essence of Calculus - 3b1b
The Internet

Hey I assume you finished reading, I would love to know your feedback or if found any error or mistake in this blog post, please do not hesitate to reach out to me.