How gradient descent works, and why it sits underneath every modern AI system

The algorithm that powers every AI tool you have used this year is older than the personal computer, the internet and the mobile phone. It was first described by the French mathematician Augustin-Louis Cauchy in 1847.

It is called gradient descent. And once you understand it, a lot of the magic around large language models stops being mysterious and starts being interesting.

Here is the picture to hold in your head. Imagine you are standing somewhere on a mountainous landscape, blindfolded, and your job is to find the lowest point in the valley. You cannot see anything. But you can feel the slope of the ground under your feet. So you take a small step in whichever direction feels like it is going downhill the most. Then you stop, feel the slope again, take another step. And another. Eventually, you arrive at the bottom of some valley.

That is gradient descent. The slope you feel under your feet is what mathematicians call the gradient. Each step takes you a little further down. Keep going long enough, and you converge on a low point.

The thing that makes this useful, and the reason it sits underneath every modern AI system, is that the landscape can represent almost any problem you care to solve. You just need a way to measure how wrong your answer is. That measure is called a loss function. A high point in the landscape means your answer is badly wrong. A low point means your answer is closer to correct. Gradient descent walks you, step by step, towards being less wrong.

What does this look like for a large language model?

Imagine you have a small language model and you want to train it on a single sentence. The model sees the words “the cat sat on the” and has to predict the next word. The correct answer is “mat”. The model, untrained, might predict “ceiling” with high confidence. The loss function says, in effect, you got that wrong, here is by how much. Gradient descent then says, here is how to adjust the millions of numbers inside the model very slightly so that next time it sees these same words, “mat” becomes a bit more likely and “ceiling” becomes a bit less likely.

Now do that for trillions of words. That is essentially how Claude, GPT, Gemini and every other modern language model is trained.

The mechanism by which the gradient gets calculated through a neural network is called backpropagation. It was published in its modern form in 1986 by David Rumelhart, Geoffrey Hinton and Ronald Williams, and Hinton went on to share the Turing Award for his decades of work on deep learning, of which this paper is a foundational piece. The maths is not the point here. The point is what backpropagation does. It takes the final error, the difference between the model’s answer and the correct answer, and pushes that error back through every layer of the network, working out exactly how much each individual parameter contributed to the mistake. Then it nudges every parameter slightly in the direction that would have reduced the error.

That is it. That is the entire learning algorithm. A loss function to measure wrongness. A gradient to indicate which way is downhill. Backpropagation to assign blame to every part of the model. Repeat several quintillion times across a data centre full of specialist chips, and you eventually have something that can write a love letter, summarise a contract, or generate a passable cover letter for a junior broker role.

Two things are worth sitting with.

The first is the elegance. The most powerful AI systems humanity has ever built run on a procedure that a Victorian mathematician would recognise. The deep learning revolution did not come from a new theoretical insight. It came from realising that if you applied a very old algorithm to enormous models, trained on enormous datasets, using enormous compute, something extraordinary happens. Scale, more than novelty, has been the secret.

The second is the honest limit. Gradient descent does not understand anything. It finds patterns that reduce prediction error. When a language model writes you a paragraph that feels insightful, what has actually happened is that, over months of training on the entire scraped internet, the system has discovered that certain combinations of words tend to follow certain other combinations of words in ways that minimise loss. The output looks like reasoning because reasoning is part of what humans wrote on the internet. The system is reflecting that pattern back at you.

Powerful pattern matching is still extraordinarily useful, and the line between very sophisticated pattern matching and what we call understanding may turn out to be thinner than our intuitions suggest. None of which is intended as criticism. The mechanism is just worth holding clearly in mind, particularly for anyone using these tools in a regulated environment, in legal or financial advice for instance. The model finds the lowest point on a landscape it learned from text. Whether you want to call that thinking is a question of philosophy, not a question of mechanism.

There is something quietly philosophical about all of this. The smartest machines we have ever built work by stumbling, step by tiny step, towards being less wrong. They have no map. They have no destination in mind. They just know which way is downhill.

You could probably build a whole worldview out of that.

Sources: Cauchy, Méthode générale pour la résolution des systèmes d’équations simultanées (1847). Rumelhart, Hinton and Williams, Learning representations by back-propagating errors (Nature, 1986). Geoffrey Hinton’s Turing Award lecture is hosted at amturing.acm.org.

Contact Details

Do you have any questions?

Leave your details and I will be in touch shortly