Many learning algorithms look unrelated at first.
Gradient descent comes from optimization. Monte Carlo methods come from statistical estimation. Temporal difference learning belongs to reinforcement learning. Robbins-Monro stochastic approximation sounds like numerical analysis. Kalman filtering lives closer to control and signal processing.
But when I look at the update rules side by side, I keep seeing the same mathematical habit:
In its smallest form, this becomes:
Keep the current estimate . Compare it with a target or feedback signal . Move a little in the direction of the error. Repeat.
This article is not trying to say that all learning algorithms are the same. They are not. Gradient descent optimizes a loss. Monte Carlo estimates returns. Temporal difference learning bootstraps from the current value estimate. Q-learning performs off-policy control. Kalman filtering is probabilistic state estimation.
The claim is weaker, but more useful: many learning algorithms can be understood as controlled approximation under uncertainty. They start with an imperfect estimate, receive imperfect feedback, correct cautiously, and repeat until the error starts to become structure.
The Smallest Form of Learning
Before neural networks, reinforcement learning, or intelligence, there is a much smaller operation: estimate something unknown from repeated observations.
Suppose is my current estimate and is a new observation. A basic correction rule is:
The term is the error between what I currently believe and what I just saw. The step size controls how much I trust the new information.
If is large, the estimate reacts quickly but becomes noisy. If is small, the estimate is stable but slow to adapt. This tradeoff already contains a surprising amount of learning theory.
A running mean has exactly this form:
An exponential moving average uses a fixed step size:
The simplest learning algorithm is not mysterious. It is repeated correction of an estimate.
Correction Through Error
In supervised learning, the correction usually comes from prediction error.
If a model predicts and the target is , then the error is:
The LMS rule, also known as the delta rule, makes the connection very explicit:
The parameter vector is the current estimate. The term tells us how the weights should change to reduce the prediction error for this example.
The perceptron update has the same spirit. When an example is misclassified, we update:
Again, the model uses a mistake as a direction for correction. The update happens in parameter space rather than directly in prediction space, but the outer logic remains the same: observe feedback, compute a correction, move the current estimate.
Correction Through Loss
Gradient descent generalizes error correction from scalar estimates to high-dimensional parameter spaces.
The update is:
At first this looks different from , because the correction is now a gradient. But for squared error,
the correction direction is proportional to:
Gradient descent can therefore be read as a geometric version of error correction. Instead of moving directly toward a visible target, we move in the direction that most decreases a loss function.
Stochastic gradient descent adds another layer:
The gradient is only estimated from a sample or mini-batch. That makes SGD a noisy approximation procedure. It does not see the whole objective surface at every step. It receives a partial signal, takes a cautious step, and repeats.
Stochastic Approximation
Stochastic approximation gives a more general mathematical language for this pattern.
A common form is:
where is the average direction toward the solution, is sampling noise, and is the step size.
In the Robbins-Monro setting, the goal is often to find a root:
The difficulty is that cannot be observed exactly. We only get noisy samples. So the algorithm moves as if it were following , while constantly being disturbed by noise.
The classical step-size condition is:
The first condition says the algorithm must keep moving enough to reach the target. The second says the steps must shrink enough for noise not to dominate forever.
This is one of the cleanest mathematical versions of learning as controlled approximation: move toward a solution, but do not trust any single noisy observation too much.
Correction Toward a Fixed Point
Reinforcement learning adds a new complication. The target is not always an external label. Sometimes the target depends on the current estimate.
For a fixed policy, the value function satisfies a Bellman equation:
Equivalently:
This is a fixed-point equation. Instead of fitting fixed labels, we are looking for a value function that remains unchanged after applying the Bellman operator.
That is why value-based reinforcement learning should not be treated as ordinary supervised learning. In supervised learning, the target usually lives outside the model. In value learning, the target may be bootstrapped from the model’s current belief about the future.
Dynamic Programming
Dynamic programming is the exact, model-based version of Bellman approximation.
In value iteration, the update is:
In policy evaluation, for a fixed policy :
These formulas use the model directly:
Because the transition probabilities and rewards are known, dynamic programming can apply the Bellman operator with exact expectations. It is fixed-point iteration without sampling noise.
Sampled Bellman Approximation
Reinforcement learning becomes necessary when the model is unknown or too expensive to use directly. We no longer apply the exact Bellman operator. We sample experience and build targets from it.
The outer template returns:
The main difference between many RL algorithms is how they define the target .
Monte Carlo Prediction
Monte Carlo prediction uses the full return:
Here:
The target is the sampled return after time .
Temporal Difference Prediction
TD prediction uses one real reward plus a bootstrapped estimate of the future:
Here:
Sarsa
Sarsa updates an action-value estimate using the action actually taken next:
The target is:
Q-learning
Q-learning uses the greedy next action under the current value estimate:
The target is:
The algorithms differ mainly in how they construct .
| Algorithm | Current Estimate | Target | Main Difference |
|---|---|---|---|
| Monte Carlo | uses the full sampled return | ||
| TD | bootstraps from the next state | ||
| Sarsa | uses the actual next action | ||
| Q-learning | uses the greedy next action |
The outer update shape stays stable. The semantics change through the target.
Monte Carlo vs TD
Monte Carlo and TD illustrate a basic statistical tradeoff.
Monte Carlo uses:
It waits for the actual return. This gives a target with low bias, but high variance. It also requires the episode to finish before the update can be made.
Temporal difference learning uses:
This target is biased because it uses the current value estimate, but it often has lower variance and can be used online.
One sentence version:
Monte Carlo trusts the sampled future. TD trusts its current estimate of the future.
Neither choice is universally better. They are different ways of controlling approximation error.
Sarsa vs Q-learning
Sarsa and Q-learning show that the same update shape can carry different policy meanings.
Sarsa uses:
The next action is the action actually taken by the behavior policy. This makes Sarsa on-policy.
Q-learning uses:
This target evaluates the greedy action, regardless of what the behavior policy actually did. That makes Q-learning off-policy.
So the same correction template does not imply the same algorithmic meaning. The target defines what is being learned.
Why It Can Work
The convergence story usually combines two ideas: contraction and step size.
When , many Bellman operators are contractions:
Intuitively, a contraction pulls value functions closer together. Repeated application moves estimates toward a unique fixed point.
Sampling adds noise, so step size becomes important. The update must move enough to approach the fixed point, but not so aggressively that random samples dominate the process.
This is why Bellman fixed-point theory and stochastic approximation naturally meet inside reinforcement learning.
Nearby Ideas
The same correction-and-approximation pattern appears outside reinforcement learning.
In bandit learning, the value estimate for an action can be updated by:
In a Kalman filter, the correction step is:
This is prediction plus gain times innovation.
The EM algorithm is also iterative approximation: estimate hidden structure, update parameters, and repeat. It does not fit the form as directly, but it belongs to the broader family of procedures that replace one impossible problem with a sequence of easier approximation steps.
Mirror descent generalizes gradient descent by changing the geometry in which correction happens. The idea of “move in the direction of improvement” remains, but distance and update geometry are no longer Euclidean by default.
The point is not to flatten these algorithms into one formula. The point is to notice a shared habit across statistics, optimization, control, and machine learning.
Where the Analogy Becomes Dangerous
The shared update shape is useful, but it can also mislead.
Same shape does not mean same guarantees.
Gradient descent optimizes a loss. Monte Carlo estimates an expectation. TD solves a bootstrapped fixed-point problem. Q-learning performs off-policy control. Kalman filtering performs probabilistic state estimation. EM maximizes likelihood through latent-variable approximation.
Deep reinforcement learning makes the warning sharper. Instead of updating table entries, we update a function approximator:
This introduces the deadly triad:
- bootstrapping
- off-policy learning
- function approximation
Each part is manageable in some settings. Together, they can create instability.
So the correction template is a mental model, not a proof. It helps organize formulas, but it does not erase the assumptions behind each algorithm.
Learning as Controlled Approximation
The phrase I keep returning to is:
Learning algorithms are disciplined ways of becoming less wrong.
They begin with an imperfect estimate, receive imperfect feedback, and apply cautious correction. Sometimes the correction is an error. Sometimes it is a gradient. Sometimes it is a sampled return. Sometimes it is a bootstrapped Bellman target. Sometimes it is an innovation term in a state estimator.
| Lens | What Learning Looks Like |
|---|---|
| Statistics | estimating an unknown quantity from samples |
| Optimization | moving parameters along an objective surface |
| Numerical analysis | iterating toward a fixed point |
| Reinforcement learning | applying sampled Bellman backups |
| Control theory | correcting predictions using observed innovation |
| Stochastic approximation | moving toward a solution under noisy feedback |
Learning is not magic. It is approximation under uncertainty, repeated carefully enough that error gradually becomes structure.
Discussion