Intuition Behind Gradient Descent

Gradient Descent (GD) is an optimization algorithm. The algorithm is simple:

1. Get a starting $x^0$
2. Do the update $x^{k+1} = x^k - t \nabla f(x^k)$

There are certain keywords or phrases which you would have seen or heard, but not understood, such as:

• “Moving down the hill”
• “direction of steepest descent”
• “direction of lowest energy”
• your instructor would draw a ‘U-shaped’ curve and move his shiny marker down one side to the bottom…

I’ll try to demystify all of these

Before you understand GD, you must understand what smooth convexity gives you. Because understanding smooth convexity tells you what GD is actually doing and how it was derived.

Smooth: A function is smooth, if it’s gradient and hessian exists at all points in the domain of the function. If the gradient and hessian exists at all points of the function, then we can generate a second order taylor approximation at each point of the function. Intuitively, we can draw a quadratic (‘U-shaped’ curve) at all points of the function. The equation for this quadratic is the well – known taylor approximation: $f(y) = f(x^k) + (y - x)^T \nabla f(x^k) + \frac{1}{2t}||y - x^k||^2$

Refer to the figure below: Convex: Now that each point of the function has it’s own ‘U-shaped curve’, we further say that the entire U curve lies above the original function. So how does this help us?

Well, think about this. You know from calculus that minimizing a quadratic is trivial. So if you take a point and its respective U curve (or quadratic approximation) and minimize that, then from convexity, that minimum will lie above your original function. The value at which the quadratic or U-Curve is minimized is the next iterate. Because the U-curve or quadratic is bound to exist (Thank you Smoothness) and the value of the function at the minimizer is ALWAYS lower than the minimum of the quadratic (Thank you Convexity).

So in GD, at each update or current iteration of $x^k$, what you’re essentially doing is

1. Building the quadratic (or U-curve) for that current iterate: $f(y) = f(x^k) + (y - x)^T \nabla f(x^k) + \frac{1}{2t}||y - x^k||^2$
2. Finding the point that minimizes the quadratic (U-curve) $\frac{\partial f(y)}{\partial y} = 0 \implies \nabla f(x^k) + \frac{1}{t}(y - x^k) = 0 \implies y = x^k - t \nabla f(x^k)$

This $y$ is your $x^{k+1}$.