What is stochastic gradient descent?

Length:

3 min

Published:

June 9, 2026

What is stochastic gradient descent?

Stochastic gradient descent (SGD) is a variant of gradient descent, the algorithm models use to learn. Plain gradient descent measures the error across the entire dataset before taking a single step. With millions of examples, that is far too slow. SGD instead estimates the direction from a small random sample of the data, takes a step, and repeats. "Stochastic" simply means random: each step is based on a different random slice.

In practice almost everyone uses mini-batch SGD, which uses a small batch (say 32 or 256 examples) per step rather than a single example. It is the default way modern neural networks and large language models are trained.

In plain words

Picking the best route across a country by surveying every road first would take forever. Instead you ask a few people nearby which way is downhill, take a step, and ask again. Each answer is a little noisy, but you move constantly and still reach the bottom. SGD trades perfect information for speed, and the speed wins.

Why it matters

It makes large-scale training possible. Without it, training on internet-scale data would be impossibly slow, because each step would need to read everything.
The noise can help. Because each step uses a different sample, the path is jittery. That randomness can bump the model out of shallow bad spots a smoother method would get stuck in.
Batch size is a real lever. Small batches mean noisier, faster steps; large batches mean smoother, slower ones. The choice affects both speed and final quality.

Common pitfalls

Learning rate set wrong. As with all gradient descent, too large a step diverges and too small a step crawls. With SGD the noise makes this even more sensitive.
Batch size as an afterthought. Picking it at random wastes hardware or hurts results. It interacts with the learning rate and should be tuned together.
Expecting a smooth curve. SGD's loss bounces around on the way down. That is normal, not a bug. Judge progress over many steps, not one.

What is gradient descent? - The base algorithm SGD speeds up, explained from the ground.
What is a neural network? - The structure SGD is most often used to train.
What is overfitting? - The failure mode training has to guard against while it learns.

Back to insights

Want to stay one step ahead?

Don't miss our best insights. No spam, just practical analyses, invitations to exclusive events, and podcast summaries delivered straight to your inbox.