Gradient descent

The downhill-walking algorithm that trains models - nudge every parameter to shrink the error, then repeat.

Gradient descent is the basic algorithm that lets a model learn, the optimization engine underneath almost all modern machine learning . Training starts with a model that’s wrong, scores how wrong with a loss (a single number measuring error), and then asks, for every one of the model’s parameters : “which way should I nudge this, and how hard, to make the loss a little smaller?” That whole bundle of directions-and-magnitudes is the gradient: the slope of the loss. Take one small step downhill along it and the model gets slightly less wrong; repeat millions of times and the error grinds toward a bottom.

The size of each step is the learning rate: too large and the model overshoots and thrashes, too small and training crawls. In practice you rarely use the whole dataset for each step, stochastic gradient descent (SGD) estimates the slope from one small batch at a time, and smarter variants like Adam adapt the step size per parameter as they go.

One thing worth keeping straight: gradient descent only ever drives the training loss down. Whether the model is genuinely learning rather than memorizing is what validation loss is there to catch.

And the machinery doesn’t have to move a model’s weights at all: freeze the model and run the same descent over its input instead, and you can search a latent space for the point that produces what you want, which is exactly how GAN-based upscalers hunt for a face that matches a blurry photo.