Many learning algorithms look unrelated at first.

Gradient descent comes from optimization. Monte Carlo methods come from statistical estimation. Temporal difference learning belongs to reinforcement learning. Robbins-Monro stochastic approximation sounds like numerical analysis. Kalman filtering lives closer to control and signal processing.

But when I look at the update rules side by side, I keep seeing the same mathematical habit:

new estimate=old estimate+step size×correction.\begin{aligned} \text{new estimate} &= \text{old estimate} \\ &\quad + \text{step size}\times\text{correction}. \end{aligned}

In its smallest form, this becomes:

XX+α(YX).X \leftarrow X+\alpha(Y-X).

Keep the current estimate XX. Compare it with a target or feedback signal YY. Move a little in the direction of the error. Repeat.

This article is not trying to say that all learning algorithms are the same. They are not. Gradient descent optimizes a loss. Monte Carlo estimates returns. Temporal difference learning bootstraps from the current value estimate. Q-learning performs off-policy control. Kalman filtering is probabilistic state estimation.

The claim is weaker, but more useful: many learning algorithms can be understood as controlled approximation under uncertainty. They start with an imperfect estimate, receive imperfect feedback, correct cautiously, and repeat until the error starts to become structure.

The Smallest Form of Learning

Before neural networks, reinforcement learning, or intelligence, there is a much smaller operation: estimate something unknown from repeated observations.

Suppose XX is my current estimate and YY is a new observation. A basic correction rule is:

XX+α(YX).X \leftarrow X+\alpha(Y-X).

The term YXY-X is the error between what I currently believe and what I just saw. The step size α\alpha controls how much I trust the new information.

If α\alpha is large, the estimate reacts quickly but becomes noisy. If α\alpha is small, the estimate is stable but slow to adapt. This tradeoff already contains a surprising amount of learning theory.

A running mean has exactly this form:

μn+1=μn+1n+1(xn+1μn).\mu_{n+1}=\mu_n+\frac{1}{n+1}(x_{n+1}-\mu_n).

An exponential moving average uses a fixed step size:

mt+1=mt+α(xtmt).m_{t+1}=m_t+\alpha(x_t-m_t).

The simplest learning algorithm is not mysterious. It is repeated correction of an estimate.

Correction Through Error

In supervised learning, the correction usually comes from prediction error.

If a model predicts y^\hat{y} and the target is yy, then the error is:

yy^.y-\hat{y}.

The LMS rule, also known as the delta rule, makes the connection very explicit:

wt+1=wt+α(yy^)x.w_{t+1}=w_t+\alpha(y-\hat{y})x.

The parameter vector ww is the current estimate. The term (yy^)x(y-\hat{y})x tells us how the weights should change to reduce the prediction error for this example.

The perceptron update has the same spirit. When an example is misclassified, we update:

wt+1=wt+αyx.w_{t+1}=w_t+\alpha yx.

Again, the model uses a mistake as a direction for correction. The update happens in parameter space rather than directly in prediction space, but the outer logic remains the same: observe feedback, compute a correction, move the current estimate.

Correction Through Loss

Gradient descent generalizes error correction from scalar estimates to high-dimensional parameter spaces.

The update is:

θt+1=θtαθL(θt).\theta_{t+1}=\theta_t-\alpha\nabla_\theta L(\theta_t).

At first this looks different from X+α(YX)X+\alpha(Y-X), because the correction is now a gradient. But for squared error,

L=12(YX)2,L=\frac{1}{2}(Y-X)^2,

the correction direction is proportional to:

YX.Y-X.

Gradient descent can therefore be read as a geometric version of error correction. Instead of moving directly toward a visible target, we move in the direction that most decreases a loss function.

Stochastic gradient descent adds another layer:

θt+1=θtαθL^(θt).\theta_{t+1}=\theta_t-\alpha \widehat{\nabla_\theta L}(\theta_t).

The gradient is only estimated from a sample or mini-batch. That makes SGD a noisy approximation procedure. It does not see the whole objective surface at every step. It receives a partial signal, takes a cautious step, and repeats.

Stochastic Approximation

Stochastic approximation gives a more general mathematical language for this pattern.

A common form is:

xt+1=xt+αt(h(xt)+Mt+1),x_{t+1}=x_t+\alpha_t(h(x_t)+M_{t+1}),

where h(xt)h(x_t) is the average direction toward the solution, Mt+1M_{t+1} is sampling noise, and αt\alpha_t is the step size.

In the Robbins-Monro setting, the goal is often to find a root:

h(x)=0.h(x^*)=0.

The difficulty is that h(x)h(x) cannot be observed exactly. We only get noisy samples. So the algorithm moves as if it were following hh, while constantly being disturbed by noise.

The classical step-size condition is:

tαt=,tαt2<.\sum_t \alpha_t = \infty,\quad \sum_t \alpha_t^2 < \infty.

The first condition says the algorithm must keep moving enough to reach the target. The second says the steps must shrink enough for noise not to dominate forever.

This is one of the cleanest mathematical versions of learning as controlled approximation: move toward a solution, but do not trust any single noisy observation too much.

Correction Toward a Fixed Point

Reinforcement learning adds a new complication. The target is not always an external label. Sometimes the target depends on the current estimate.

For a fixed policy, the value function satisfies a Bellman equation:

Vπ=rπ+γPπVπ.V^\pi = r^\pi + \gamma P^\pi V^\pi.

Equivalently:

Vπ=TπVπ.V^\pi = T^\pi V^\pi.

This is a fixed-point equation. Instead of fitting fixed labels, we are looking for a value function that remains unchanged after applying the Bellman operator.

That is why value-based reinforcement learning should not be treated as ordinary supervised learning. In supervised learning, the target usually lives outside the model. In value learning, the target may be bootstrapped from the model’s current belief about the future.

Dynamic Programming

Dynamic programming is the exact, model-based version of Bellman approximation.

In value iteration, the update is:

Vk+1(s)=maxa[R(s,a)+γsP(ss,a)Vk(s)].V_{k+1}(s)=\max_a\left[R(s,a)+\gamma\sum_{s'}P(s'|s,a)V_k(s')\right].

In policy evaluation, for a fixed policy π\pi:

Vk+1(s)=Rπ(s)+γsPπ(ss)Vk(s).V_{k+1}(s)=R^\pi(s)+\gamma\sum_{s'}P^\pi(s'|s)V_k(s').

These formulas use the model directly:

P(ss,a),R(s,a).P(s'|s,a),\quad R(s,a).

Because the transition probabilities and rewards are known, dynamic programming can apply the Bellman operator with exact expectations. It is fixed-point iteration without sampling noise.

Sampled Bellman Approximation

Reinforcement learning becomes necessary when the model is unknown or too expensive to use directly. We no longer apply the exact Bellman operator. We sample experience and build targets from it.

The outer template returns:

XX+α(YX).X \leftarrow X+\alpha(Y-X).

The main difference between many RL algorithms is how they define the target YY.

Monte Carlo Prediction

Monte Carlo prediction uses the full return:

V(St)V(St)+α(GtV(St)).V(S_t)\leftarrow V(S_t)+\alpha(G_t-V(S_t)).

Here:

Y=Gt.Y=G_t.

The target is the sampled return after time tt.

Temporal Difference Prediction

TD prediction uses one real reward plus a bootstrapped estimate of the future:

V(St)V(St)+α(Rt+1+γV(St+1)V(St)).V(S_t)\leftarrow V(S_t)+\alpha(R_{t+1}+\gamma V(S_{t+1})-V(S_t)).

Here:

Y=Rt+1+γV(St+1).Y=R_{t+1}+\gamma V(S_{t+1}).

Sarsa

Sarsa updates an action-value estimate using the action actually taken next:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)].Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)].

The target is:

Y=Rt+1+γQ(St+1,At+1).Y=R_{t+1}+\gamma Q(S_{t+1},A_{t+1}).

Q-learning

Q-learning uses the greedy next action under the current value estimate:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)].Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma\max_a Q(S_{t+1},a)-Q(S_t,A_t)].

The target is:

Y=Rt+1+γmaxaQ(St+1,a).Y=R_{t+1}+\gamma\max_a Q(S_{t+1},a).

The algorithms differ mainly in how they construct YY.

AlgorithmCurrent Estimate XXTarget YYMain Difference
Monte CarloV(St)V(S_t)GtG_tuses the full sampled return
TDV(St)V(S_t)R+γV(S)R+\gamma V(S')bootstraps from the next state
SarsaQ(S,A)Q(S,A)R+γQ(S,A)R+\gamma Q(S',A')uses the actual next action
Q-learningQ(S,A)Q(S,A)R+γmaxaQ(S,a)R+\gamma\max_a Q(S',a)uses the greedy next action

The outer update shape stays stable. The semantics change through the target.

Monte Carlo vs TD

Monte Carlo and TD illustrate a basic statistical tradeoff.

Monte Carlo uses:

Y=Gt.Y=G_t.

It waits for the actual return. This gives a target with low bias, but high variance. It also requires the episode to finish before the update can be made.

Temporal difference learning uses:

Y=Rt+1+γV(St+1).Y=R_{t+1}+\gamma V(S_{t+1}).

This target is biased because it uses the current value estimate, but it often has lower variance and can be used online.

One sentence version:

Monte Carlo trusts the sampled future. TD trusts its current estimate of the future.

Neither choice is universally better. They are different ways of controlling approximation error.

Sarsa vs Q-learning

Sarsa and Q-learning show that the same update shape can carry different policy meanings.

Sarsa uses:

R+γQ(S,A).R+\gamma Q(S',A').

The next action AA' is the action actually taken by the behavior policy. This makes Sarsa on-policy.

Q-learning uses:

R+γmaxaQ(S,a).R+\gamma\max_a Q(S',a).

This target evaluates the greedy action, regardless of what the behavior policy actually did. That makes Q-learning off-policy.

So the same correction template does not imply the same algorithmic meaning. The target defines what is being learned.

Why It Can Work

The convergence story usually combines two ideas: contraction and step size.

When γ<1\gamma<1, many Bellman operators are contractions:

TVTUγVU.\|TV-TU\|_\infty \le \gamma \|V-U\|_\infty.

Intuitively, a contraction pulls value functions closer together. Repeated application moves estimates toward a unique fixed point.

Sampling adds noise, so step size becomes important. The update must move enough to approach the fixed point, but not so aggressively that random samples dominate the process.

This is why Bellman fixed-point theory and stochastic approximation naturally meet inside reinforcement learning.

Nearby Ideas

The same correction-and-approximation pattern appears outside reinforcement learning.

In bandit learning, the value estimate for an action can be updated by:

Qn+1(a)=Qn(a)+α(RnQn(a)).Q_{n+1}(a)=Q_n(a)+\alpha(R_n-Q_n(a)).

In a Kalman filter, the correction step is:

x^tt=x^tt1+Kt(ztHx^tt1).\hat{x}_{t|t}=\hat{x}_{t|t-1}+K_t(z_t-H\hat{x}_{t|t-1}).

This is prediction plus gain times innovation.

The EM algorithm is also iterative approximation: estimate hidden structure, update parameters, and repeat. It does not fit the X+α(YX)X+\alpha(Y-X) form as directly, but it belongs to the broader family of procedures that replace one impossible problem with a sequence of easier approximation steps.

Mirror descent generalizes gradient descent by changing the geometry in which correction happens. The idea of “move in the direction of improvement” remains, but distance and update geometry are no longer Euclidean by default.

The point is not to flatten these algorithms into one formula. The point is to notice a shared habit across statistics, optimization, control, and machine learning.

Where the Analogy Becomes Dangerous

The shared update shape is useful, but it can also mislead.

Same shape does not mean same guarantees.

Gradient descent optimizes a loss. Monte Carlo estimates an expectation. TD solves a bootstrapped fixed-point problem. Q-learning performs off-policy control. Kalman filtering performs probabilistic state estimation. EM maximizes likelihood through latent-variable approximation.

Deep reinforcement learning makes the warning sharper. Instead of updating table entries, we update a function approximator:

Qθ(s,a).Q_\theta(s,a).

This introduces the deadly triad:

  • bootstrapping
  • off-policy learning
  • function approximation

Each part is manageable in some settings. Together, they can create instability.

So the correction template is a mental model, not a proof. It helps organize formulas, but it does not erase the assumptions behind each algorithm.

Learning as Controlled Approximation

The phrase I keep returning to is:

Learning algorithms are disciplined ways of becoming less wrong.

They begin with an imperfect estimate, receive imperfect feedback, and apply cautious correction. Sometimes the correction is an error. Sometimes it is a gradient. Sometimes it is a sampled return. Sometimes it is a bootstrapped Bellman target. Sometimes it is an innovation term in a state estimator.

LensWhat Learning Looks Like
Statisticsestimating an unknown quantity from samples
Optimizationmoving parameters along an objective surface
Numerical analysisiterating toward a fixed point
Reinforcement learningapplying sampled Bellman backups
Control theorycorrecting predictions using observed innovation
Stochastic approximationmoving toward a solution under noisy feedback

Learning is not magic. It is approximation under uncertainty, repeated carefully enough that error gradually becomes structure.