Expectation and Variance

Expectation and variance are two fundamental concepts in probability theory that describe the central tendency and spread of a random variable’s distribution.

Expected Value (Mean)

The expected value, also known as the mean or expectation, represents the average value of a random variable over many trials.

For Discrete Random Variables

E[X]=μX=xxpX(x)\mathbb{E}[X] = \mu_X = \sum_{x} x \cdot p_X(x)

where:

  • pX(x)p_X(x) is the probability mass function (PMF)
  • The sum is taken over all possible values of XX

Properties:

  • Linearity and Homogeneity: E[aX+b]=aE[X]+b\mathbb{E}[aX + b] = a\mathbb{E}[X] + b
  • For two random variables: E[X+Y]=E[X]+E[Y]\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]
  • For independent random variables: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]

Proofs of Key Properties

ProofLinearity

For a,bRa, b \in \mathbb{R}

E[aX+b]=x(ax+b)pX(x)=axxpX(x)+bxpX(x)=aE[X]+b\begin{aligned} \mathbb{E}[aX + b] &= \sum_{x} (ax + b) \cdot p_X(x) \\ &= a\sum_{x} x \cdot p_X(x) + b\sum_{x} p_X(x) \\ &= a\mathbb{E}[X] + b \end{aligned}
ProofAdditivity
E[X+Y]=xy(x+y)pX,Y(x,y)=xyxpX,Y(x,y)+xyypX,Y(x,y)=E[X]+E[Y]\begin{aligned} \mathbb{E}[X + Y] &= \sum_{x}\sum_{y} (x + y) \cdot p_{X,Y}(x,y) \\ &= \sum_{x}\sum_{y} x \cdot p_{X,Y}(x,y) + \sum_{x}\sum_{y} y \cdot p_{X,Y}(x,y) \\ &= \mathbb{E}[X] + \mathbb{E}[Y] \end{aligned}
ProofProduct for Independent Variables

If XX and YY are independent, then pX,Y(x,y)=pX(x)pY(y)p_{X,Y}(x,y) = p_X(x)p_Y(y), so:

E[XY]=xyxypX,Y(x,y)=xyxypX(x)pY(y)=(xxpX(x))(yypY(y))=E[X]E[Y]\begin{aligned} \mathbb{E}[XY] &= \sum_{x}\sum_{y} xy \cdot p_{X,Y}(x,y) \\ &= \sum_{x}\sum_{y} xy \cdot p_X(x)p_Y(y) \\ &= \left(\sum_{x} x p_X(x)\right)\left(\sum_{y} y p_Y(y)\right) \\ &= \mathbb{E}[X]\mathbb{E}[Y] \end{aligned}

For Continuous Random Variables

E[X]=μX=xfX(x)dx\mathbb{E}[X] = \mu_X = \int_{-\infty}^{\infty} x \cdot f_X(x)dx

where:

  • fX(x)f_X(x) is the probability density function (PDF)

Variance

Variance measures how much the values of a random variable deviate from its mean.

DefinitionVariance

V(X)=σX2=E[(XμX)2]=E[X2](E[X])2\mathbb{V}(X) = \sigma_X^2 = \mathbb{E}[(X - \mu_X)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

For Discrete Random Variables

V(X)=x(xμX)2pX(x)\mathbb{V}(X) = \sum_{x} (x - \mu_X)^2 \cdot p_X(x)

For Continuous Random Variables

V(X)=(xμX)2fX(x)dx\mathbb{V}(X) = \int_{-\infty}^{\infty} (x - \mu_X)^2 \cdot f_X(x)dx

Standard Deviation

The standard deviation is the square root of the variance: σX=V(X)\sigma_X = \sqrt{\mathbb{V}(X)}

Properties of Variance

  • V(X)0\mathbb{V}(X) \geq 0
  • V(a)=0\mathbb{V}(a) = 0 for any constant aa
  • V(aX)=a2V(X)\mathbb{V}(aX) = a^2 \mathbb{V}(X)
  • V(X+a)=V(X)\mathbb{V}(X + a) = \mathbb{V}(X)
  • For independent random variables: V(X+Y)=V(X)+V(Y)\mathbb{V}(X + Y) = \mathbb{V}(X) + \mathbb{V}(Y)

Proofs of Variance Properties

ProofScaling

For aRa \in \mathbb{R}

V(aX)=E[(aXE[aX])2]=E[(aXaE[X])2]=E[a2(XE[X])2]=a2E[(XE[X])2]=a2V(X)\begin{aligned} \mathbb{V}(aX) &= \mathbb{E}[(aX - \mathbb{E}[aX])^2] \\ &= \mathbb{E}[(aX - a\mathbb{E}[X])^2] \\ &= \mathbb{E}[a^2(X - \mathbb{E}[X])^2] \\ &= a^2\mathbb{E}[(X - \mathbb{E}[X])^2] \\ &= a^2\mathbb{V}(X) \end{aligned}
ProofShift Invariance
V(X+a)=E[(X+aE[X+a])2]=E[(X+aE[X]a)2]=E[(XE[X])2]=V(X)\begin{aligned} \mathbb{V}(X + a) &= \mathbb{E}[(X + a - \mathbb{E}[X + a])^2] \\ &= \mathbb{E}[(X + a - \mathbb{E}[X] - a)^2] \\ &= \mathbb{E}[(X - \mathbb{E}[X])^2] \\ &= \mathbb{V}(X) \end{aligned}
ProofAdditivity for Independent Variables

If XX and YY are independent:

V(X+Y)=E[(X+Y)2](E[X+Y])2=E[X2+2XY+Y2](E[X]+E[Y])2=E[X2]+2E[X]E[Y]+E[Y2]E[X]22E[X]E[Y]E[Y]2=(E[X2]E[X]2)+(E[Y2]E[Y]2)=V(X)+V(Y)\begin{aligned} \mathbb{V}(X + Y) &= \mathbb{E}[(X + Y)^2] - (\mathbb{E}[X + Y])^2 \\ &= \mathbb{E}[X^2 + 2XY + Y^2] - (\mathbb{E}[X] + \mathbb{E}[Y])^2 \\ &= \mathbb{E}[X^2] + 2\mathbb{E}[X]\mathbb{E}[Y] + \mathbb{E}[Y^2] - \mathbb{E}[X]^2 - 2\mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[Y]^2 \\ &= (\mathbb{E}[X^2] - \mathbb{E}[X]^2) + (\mathbb{E}[Y^2] - \mathbb{E}[Y]^2) \\ &= \mathbb{V}(X) + \mathbb{V}(Y) \end{aligned}

Examples

ExampleDiscrete Case (Die Roll)

For a fair six-sided die:

  • PMF: pX(x)=16p_X(x) = \frac{1}{6} for x{1,2,3,4,5,6}x \in \{1, 2, 3, 4, 5, 6\}

Expected Value: E[X]=x=16x16=1+2+3+4+5+66=216=3.5\mathbb{E}[X] = \sum_{x=1}^{6} x \cdot \frac{1}{6} = \frac{1+2+3+4+5+6}{6} = \frac{21}{6} = 3.5

Variance: E[X2]=x=16x216=1+4+9+16+25+366=916\mathbb{E}[X^2] = \sum_{x=1}^{6} x^2 \cdot \frac{1}{6} = \frac{1+4+9+16+25+36}{6} = \frac{91}{6} V(X)=E[X2](E[X])2=916(3.5)2=916494=18214712=35122.92\mathbb{V}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \frac{91}{6} - (3.5)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.92

ExampleContinuous Case (Normal Distribution)

For XN(μ,σ2)X \sim N(\mu, \sigma^2):

  • PDF: fX(x)=1σ2πe(xμ)22σ2f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Expected Value: E[X]=μ\mathbb{E}[X] = \mu

Variance: V(X)=σ2\mathbb{V}(X) = \sigma^2

ExampleContinuous Case (Uniform Distribution)

For XU(a,b)X \sim U(a, b):

  • PDF: fX(x)=1baf_X(x) = \frac{1}{b-a} for axba \leq x \leq b

Expected Value: E[X]=abx1badx=a+b2\mathbb{E}[X] = \int_a^b x \cdot \frac{1}{b-a} dx = \frac{a+b}{2}

Variance: V(X)=ab(xa+b2)21badx=(ba)212\mathbb{V}(X) = \int_a^b \left(x - \frac{a+b}{2}\right)^2 \cdot \frac{1}{b-a} dx = \frac{(b-a)^2}{12}

Expectation of Functions of Random Variables

When we apply a function to a random variable, we obtain a new random variable. Computing the expectation of this new random variable is a fundamental problem in probability theory.

Law of the Unconscious Statistician (LOTUS)

The core principle for computing expectations of functions of random variables is the Law of the Unconscious Statistician (LOTUS). This law states that to compute E[g(X)]\mathbb{E}[g(X)], we don’t need to first find the distribution of g(X)g(X). Instead, we can work directly with the original distribution of XX.

Computation Formula

For a function g:RRg: \mathbb{R} \to \mathbb{R} and random variable XX, the expectation of g(X)g(X) is:

E[g(X)]={xg(x)pX(x)(discrete)g(x)fX(x)dx(continuous)\mathbb{E}[g(X)] = \begin{cases} \sum_{x} g(x) \cdot p_X(x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} g(x) \cdot f_X(x) dx & \text{(continuous)} \end{cases}

Important Properties

  1. Linearity: E[ag(X)+bh(X)]=aE[g(X)]+bE[h(X)]\mathbb{E}[a \cdot g(X) + b \cdot h(X)] = a\mathbb{E}[g(X)] + b\mathbb{E}[h(X)]
  2. Monotonicity: If g(x)h(x)g(x) \leq h(x) for all xx, then E[g(X)]E[h(X)]\mathbb{E}[g(X)] \leq \mathbb{E}[h(X)]

Application Examples

ExampleExpectation of Square Function

For any random variable XX, computing E[X2]\mathbb{E}[X^2]:

  • Discrete case: E[X2]=xx2pX(x)\mathbb{E}[X^2] = \sum_{x} x^2 \cdot p_X(x)
  • Continuous case: E[X2]=x2fX(x)dx\mathbb{E}[X^2] = \int_{-\infty}^{\infty} x^2 \cdot f_X(x) dx

This result gives the standard variance formula: V(X)=E[X2](E[X])2\mathbb{V}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

ExampleExpectation of Exponential Function

For any random variable XX, computing E[etX]\mathbb{E}[e^{tX}]:

  • Discrete case: E[etX]=xetxpX(x)\mathbb{E}[e^{tX}] = \sum_{x} e^{tx} \cdot p_X(x)
  • Continuous case: E[etX]=etxfX(x)dx\mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} \cdot f_X(x) dx

This is the definition of the moment generating function, which has wide applications in probability theory.

Numerical Estimation Methods

When functions are complex or distributions are non-standard, analytical solutions may be difficult to obtain. In such cases, we can use Taylor series approximation for numerical estimation.

Taylor Series Approximation Method

For a random variable XX with mean μ\mu and variance σ2\sigma^2, the expectation and variance of f(X)f(X) can be approximated using Taylor expansion.

ProofApproximation Derivation for Expectation
  1. Perform second-order Taylor expansion of f(X)f(X) around μ\mu: f(X)=f(μ)+f(μ)(Xμ)+f(μ)2(Xμ)2+R2f(X) = f(\mu) + f'(\mu)(X-\mu) + \frac{f''(\mu)}{2}(X-\mu)^2 + R_2 where R2R_2 is the remainder term.

  2. Take expectation of both sides: E[f(X)]=E[f(μ)]+E[f(μ)(Xμ)]+E[f(μ)2(Xμ)2]+E[R2]\mathbb{E}[f(X)] = \mathbb{E}[f(\mu)] + \mathbb{E}[f'(\mu)(X-\mu)] + \mathbb{E}\left[\frac{f''(\mu)}{2}(X-\mu)^2\right] + \mathbb{E}[R_2]

  3. Since f(μ)f(\mu), f(μ)f'(\mu), and f(μ)f''(\mu) are constants: E[f(X)]=f(μ)+f(μ)E[Xμ]+f(μ)2E[(Xμ)2]+E[R2]\mathbb{E}[f(X)] = f(\mu) + f'(\mu)\mathbb{E}[X-\mu] + \frac{f''(\mu)}{2}\mathbb{E}[(X-\mu)^2] + \mathbb{E}[R_2]

  4. Using E[Xμ]=0\mathbb{E}[X-\mu] = 0 and E[(Xμ)2]=σ2\mathbb{E}[(X-\mu)^2] = \sigma^2, and ignoring higher-order remainder terms: E[f(X)]f(μ)+f(μ)2σ2\mathbb{E}[f(X)] \approx f(\mu) + \frac{f''(\mu)}{2}\sigma^2

ProofApproximation Derivation for Variance
  1. Use first-order Taylor expansion (usually sufficient for variance calculation): f(X)f(μ)+f(μ)(Xμ)f(X) \approx f(\mu) + f'(\mu)(X-\mu)

  2. Since f(μ)f(\mu) is constant, it doesn’t affect variance: V[f(X)]V[f(μ)(Xμ)]\mathbb{V}[f(X)] \approx \mathbb{V}[f'(\mu)(X-\mu)]

  3. Constant factors can be factored out: V[f(X)][f(μ)]2V[Xμ]\mathbb{V}[f(X)] \approx [f'(\mu)]^2 \mathbb{V}[X-\mu]

  4. Since V[Xμ]=V[X]=σ2\mathbb{V}[X-\mu] = \mathbb{V}[X] = \sigma^2: V[f(X)][f(μ)]2σ2\mathbb{V}[f(X)] \approx [f'(\mu)]^2 \sigma^2

Summary Formulas:

E[f(X)]f(μ)+f(μ)σ22V[f(X)](f(μ))2σ2\begin{aligned} \mathbb{E}\left[f(X)\right] &\approx f(\mu) + f''(\mu)\frac{\sigma^2}{2} \\ \mathbb{V}\left[f(X)\right] &\approx \left(f'(\mu)\right)^2\sigma^2 \end{aligned}

Approximation Accuracy Notes

  • The expectation approximation uses second-order expansion, providing higher accuracy
  • The variance approximation uses first-order expansion; for strongly nonlinear functions, higher-order terms may be needed
  • When f(X)f(X) is a linear function, the approximation is exact
  • The more concentrated the distribution of XX (smaller σ2\sigma^2), the better the approximation

Covariance and Correlation

When working with multiple random variables, we often want to measure their relationship.

Covariance

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Correlation Coefficient

ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

Properties:

  • 1ρX,Y1-1 \leq \rho_{X,Y} \leq 1
  • ρ=1\rho = 1: Perfect positive linear relationship
  • ρ=1\rho = -1: Perfect negative linear relationship
  • ρ=0\rho = 0: No linear relationship (but may have non-linear relationship)

Common Distributions and Their Moments

DistributionExpected ValueVariance
Bernoulli(p)ppp(1p)p(1-p)
Binomial(n,p)npnpnp(1p)np(1-p)
Poisson(λ)λ\lambdaλ\lambda
Uniform(a,b)a+b2\frac{a+b}{2}(ba)212\frac{(b-a)^2}{12}
Normal(μ,σ²)μ\muσ2\sigma^2
Exponential(λ)1λ\frac{1}{\lambda}1λ2\frac{1}{\lambda^2}

Important Theorems

TheoremLaw of Large Numbers

For i.i.d. random variables X1,X2,...,XnX_1, X_2, ..., X_n with mean μ\mu: 1ni=1nXiPμ as n\frac{1}{n}\sum_{i=1}^{n} X_i \xrightarrow{P} \mu \text{ as } n \to \infty

TheoremCentral Limit Theorem

For i.i.d. random variables with mean μ\mu and variance σ2\sigma^2: i=1nXinμσnDN(0,1) as n\frac{\sum_{i=1}^{n} X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{D} N(0,1) \text{ as } n \to \infty

Expectation with Multiple Random Variables

When working with functions of multiple random variables, we need to understand how to compute their expectations.

Expectation of Functions of Multiple Variables

For a function g(X,Y)g(X,Y) of two random variables, the expectation is computed using the joint distribution:

E[g(X,Y)]={xyg(x,y)pX,Y(x,y)(discrete)R2g(x,y)fX,Y(x,y)dxdy(continuous)\mathbb{E}[g(X,Y)] = \begin{cases} \sum_{x}\sum_{y} g(x,y) \cdot p_{X,Y}(x,y) & \text{(discrete)} \\ \iint_{\mathbb{R}^2} g(x,y) \cdot f_{X,Y}(x,y) dx dy & \text{(continuous)} \end{cases}

Key Properties

From this definition, we derive important properties:

  1. Linearity: E[X+Y]=E[X]+E[Y]\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y] (always holds)
  2. Products: E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] (holds only when X and Y are independent)

Computing Expectations from Joint Distributions

Geometric Interpretation for Continuous Case

For a joint probability density function f(x,y)f(x,y), computing E[X]\mathbb{E}[X] involves integrating over the entire plane:

E[X]=R2xf(x,y)dxdy\mathbb{E}[X] = \iint_{\mathbb{R}^2} x \cdot f(x,y) dx dy

This can be understood geometrically as finding the “center of mass” in the x-direction of the 3D surface formed by the joint density.

The computation can be done in two equivalent ways:

  1. Direct integration: Integrate xf(x,y)x \cdot f(x,y) over the entire plane
  2. Using marginal density: First find fX(x)=f(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f(x,y) dy, then compute E[X]=xfX(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \cdot f_X(x) dx

The second approach works because: E[X]=xf(x,y)dydx=x(f(x,y)dy)dx=xfX(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} x \cdot f(x,y) dy dx = \int_{-\infty}^{\infty} x \left(\int_{-\infty}^{\infty} f(x,y) dy\right) dx = \int_{-\infty}^{\infty} x \cdot f_X(x) dx

Connection to Discrete Case

Similarly, for discrete random variables: E[X]=xyxpX,Y(x,y)=xx(ypX,Y(x,y))=xxpX(x)\mathbb{E}[X] = \sum_{x}\sum_{y} x \cdot p_{X,Y}(x,y) = \sum_{x} x \left(\sum_{y} p_{X,Y}(x,y)\right) = \sum_{x} x \cdot p_X(x)

This shows that whether we work with joint distributions directly or first compute marginal distributions, we arrive at the same expectation.

Conditional Expectation

The conditional expectation of YY given X=xX = x is:

E[YX=x]={yypYX(yx)(discrete)yfYX(yx)dy(continuous)\mathbb{E}[Y|X = x] = \begin{cases} \sum_{y} y \cdot p_{Y|X}(y|x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} y \cdot f_{Y|X}(y|x) dy & \text{(continuous)} \end{cases}

This leads to the law of total expectation: E[Y]=E[E[YX]]\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]]