Common Mathematical Proofs

Bukti-bukti yang Sering Direferensikan

1 Overview

This page collects proofs for results that are frequently cited in econometrics and ML textbooks without full derivation. Each proof is presented with a theorem statement, step-by-step derivation, and a note on where the result is used.

2 A. Linear Algebra Proofs

2.1 Proof 1: $(AB)^T = B^T A^T$

Theorem. For matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$: \[ (AB)^T = B^T A^T \]

Proof.

Let $C = AB$, so $C \in \mathbb{R}^{m \times p}$ with entries: \[ C_{ij} = (AB)_{ij} = \sum_{k=1}^n A_{ik} B_{kj} \]

The transpose $C^T$ has entries: \[ (C^T)_{ij} = C_{ji} = \sum_{k=1}^n A_{jk} B_{ki} \]

Now compute $(B^T A^T)_{ij}$. We have $(B^T)_{ik} = B_{ki}$ and $(A^T)_{kj} = A_{jk}$, so: \[ (B^T A^T)_{ij} = \sum_{k=1}^n (B^T)_{ik} (A^T)_{kj} = \sum_{k=1}^n B_{ki} A_{jk} = \sum_{k=1}^n A_{jk} B_{ki} \]

Since $(C^T)_{ij} = (B^T A^T)_{ij}$ for all $i, j$, we conclude $(AB)^T = B^T A^T$. $\square$

Tip

Where it’s used: Constantly. Transposing products appears in OLS ($\hat{\beta} = (X^TX)^{-1}X^Ty$), matrix calculus derivations, and any manipulation of quadratic forms.

2.2 Proof 2: $(AB)^{-1} = B^{-1} A^{-1}$

Theorem. For invertible matrices $A, B \in \mathbb{R}^{n \times n}$: \[ (AB)^{-1} = B^{-1} A^{-1} \]

Proof.

We show that $B^{-1}A^{-1}$ is the inverse of $AB$ by verifying both left and right inverse conditions.

Right inverse: \[ (AB)(B^{-1}A^{-1}) = A(BB^{-1})A^{-1} = A \cdot I \cdot A^{-1} = AA^{-1} = I \]

Left inverse: \[ (B^{-1}A^{-1})(AB) = B^{-1}(A^{-1}A)B = B^{-1} \cdot I \cdot B = B^{-1}B = I \]

Since both products equal $I$, $B^{-1}A^{-1}$ is the (unique) inverse of $AB$. $\square$

Tip

Where it’s used: Manipulating the OLS formula, GLS, and any expression involving inverse of a matrix product. Note the reversal of order — a common source of errors.

2.3 Proof 3: $\text{tr}(AB) = \text{tr}(BA)$

Theorem. For $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times m}$: \[ \text{tr}(AB) = \text{tr}(BA) \]

Proof.

By definition, $\text{tr}(AB) = \sum_{i=1}^m (AB)_{ii}$. Expanding: \[ \text{tr}(AB) = \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \sum_{k=1}^n A_{ik} B_{ki} \]

Rearranging the sum (finite sums commute): \[ = \sum_{k=1}^n \sum_{i=1}^m B_{ki} A_{ik} = \sum_{k=1}^n (BA)_{kk} = \text{tr}(BA) \]

$\square$

Tip

Where it’s used: Matrix calculus (deriving $\frac{\partial}{\partial X}\text{tr}(AX) = A^T$), simplifying likelihood expressions, information geometry. The cyclic property extends: $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$.

2.4 Proof 4: $\det(AB) = \det(A)\det(B)$

Theorem. For $A, B \in \mathbb{R}^{n \times n}$: \[ \det(AB) = \det(A)\det(B) \]

Proof (via elementary matrices).

Every invertible matrix can be expressed as a product of elementary matrices $E_1, \ldots, E_r$ (row operations). The determinant of an elementary matrix is a simple scalar, and $\det(E_i A) = \det(E_i)\det(A)$ holds by direct computation for each type of row operation. By induction, $\det(AB) = \det(A)\det(B)$ for invertible $A$.

For singular $A$: if $\det(A) = 0$, then $A$ is singular, meaning its columns are linearly dependent. This means $AB$ also has linearly dependent columns, so $\det(AB) = 0 = 0 \cdot \det(B) = \det(A)\det(B)$.

Geometric intuition: $\det(A)$ measures the volume scaling factor of the linear map $A$. Composing two linear maps scales volume by $\det(A) \cdot \det(B)$. $\square$

Tip

Where it’s used: Likelihood calculations involving multivariate normal densities (which include $\det(\Sigma)$); Jacobian of change-of-variables; matrix determinant lemma.

2.5 Proof 5: $\text{tr}(A) = \sum_i \lambda_i$

Theorem. The trace of a square matrix equals the sum of its eigenvalues (counted with multiplicity).

Proof.

The characteristic polynomial of $A \in \mathbb{R}^{n \times n}$ is: \[ p(\lambda) = \det(\lambda I - A) \]

The eigenvalues $\lambda_1, \ldots, \lambda_n$ are the roots, so: \[ p(\lambda) = (\lambda - \lambda_1)(\lambda - \lambda_2)\cdots(\lambda - \lambda_n) \]

Expanding the right side, the coefficient of $\lambda^{n-1}$ is $-(\lambda_1 + \lambda_2 + \cdots + \lambda_n)$.

On the left side, $\det(\lambda I - A)$ expands as follows. The only term contributing to $\lambda^{n-1}$ comes from the product of diagonal entries $(\lambda - a_{11})(\lambda - a_{22})\cdots(\lambda - a_{nn})$, which contributes $-(\sum_i a_{ii})\lambda^{n-1} = -\text{tr}(A)\lambda^{n-1}$.

Matching coefficients of $\lambda^{n-1}$: \[ -\text{tr}(A) = -(\lambda_1 + \cdots + \lambda_n) \implies \text{tr}(A) = \sum_{i=1}^n \lambda_i \]

$\square$

Tip

Where it’s used: Checking computations, understanding the trace as an eigenvalue-based quantity, PCA (variance explained = sum of eigenvalues of $\Sigma$).

2.6 Proof 6: $\det(A) = \prod_i \lambda_i$

Theorem. The determinant of a square matrix equals the product of its eigenvalues.

Proof.

Using the characteristic polynomial from Proof 5, evaluate $p(\lambda)$ at $\lambda = 0$: \[ p(0) = \det(0 \cdot I - A) = \det(-A) = (-1)^n \det(A) \]

From the factored form: \[ p(0) = (0 - \lambda_1)(0 - \lambda_2)\cdots(0-\lambda_n) = (-1)^n \lambda_1 \lambda_2 \cdots \lambda_n \]

Therefore: \[ (-1)^n \det(A) = (-1)^n \prod_i \lambda_i \implies \det(A) = \prod_{i=1}^n \lambda_i \]

$\square$

Tip

Where it’s used: A matrix is invertible iff $\det(A) \neq 0$ iff all eigenvalues are nonzero. Used to check positive definiteness ($\det(A) > 0$ is necessary but not sufficient — need all $\lambda_i > 0$).

3 B. Calculus and Optimization Proofs

3.1 Proof 7: $\frac{\partial}{\partial \mathbf{x}}(A\mathbf{x}) = A^T$

Theorem. For $A \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^n$, let $f(\mathbf{x}) = A\mathbf{x}$. Then: \[ \frac{\partial f}{\partial \mathbf{x}} = A^T \] in the sense that $\nabla_{x_j}(A\mathbf{x})_i = A_{ij}$, so the Jacobian of $A\mathbf{x}$ is $A$, and the gradient of a scalar $\mathbf{a}^T A \mathbf{x}$ w.r.t. $\mathbf{x}$ is $A^T \mathbf{a}$.

Proof.

Consider the scalar function $g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x}$ where $\mathbf{a} \in \mathbb{R}^m$ is fixed.

\[ g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x} = \sum_{i=1}^m a_i (A\mathbf{x})_i = \sum_{i=1}^m a_i \sum_{j=1}^n A_{ij} x_j = \sum_{j=1}^n \left(\sum_{i=1}^m A_{ij} a_i\right) x_j \]

The partial derivative with respect to $x_k$: \[ \frac{\partial g}{\partial x_k} = \sum_{i=1}^m A_{ik} a_i = (A^T \mathbf{a})_k \]

Therefore $\nabla_\mathbf{x} g = A^T \mathbf{a}$. In particular, if $g(\mathbf{x}) = \mathbf{e}_j^T A \mathbf{x} = (A\mathbf{x})_j$, then $_(A)j = (A^T){j} = $ the $j$-th column of $A^T$. $\square$

Tip

Where it’s used: Foundation of all matrix calculus. Essential for deriving gradient descent updates, normal equations, and variational problems.

3.2 Proof 8: $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}$ (for symmetric $A$)

Theorem. For symmetric $A \in \mathbb{R}^{n \times n}$ and $\mathbf{x} \in \mathbb{R}^n$: \[ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x} \]

Proof.

Write $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j$.

Taking the partial derivative with respect to $x_k$: \[ \frac{\partial f}{\partial x_k} = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n A_{ik} x_i \]

The first sum is $(A\mathbf{x})_k$ and the second sum is $(A^T \mathbf{x})_k$. Using $A^T = A$ (symmetry): \[ \frac{\partial f}{\partial x_k} = (A\mathbf{x})_k + (A\mathbf{x})_k = 2(A\mathbf{x})_k \]

In vector form: $\nabla_\mathbf{x}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}$. $\square$

Without symmetry: When $A$ is not symmetric, $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$.

Tip

Where it’s used: Deriving OLS (the SSR is a quadratic form); quadratic programming; Mahalanobis distance derivatives.

3.3 Proof 9: OLS Normal Equations

Theorem. The OLS estimator $\hat{\beta}$ minimizes $S(\beta) = (y - X\beta)^T(y - X\beta)$ and satisfies: \[ X^T X \hat{\beta} = X^T y \]

When $X^TX$ is invertible: $\hat{\beta} = (X^TX)^{-1}X^Ty$.

Proof.

Expand $S(\beta)$: \[ S(\beta) = (y - X\beta)^T(y - X\beta) = y^Ty - y^TX\beta - \beta^TX^Ty + \beta^TX^TX\beta \]

Note $y^TX\beta = (y^TX\beta)^T = \beta^TX^Ty$ (scalar equals its own transpose), so: \[ S(\beta) = y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta \]

Take the gradient with respect to $\beta$ using results from Proofs 7 and 8. Note $X^TX$ is symmetric: \[ \frac{\partial S}{\partial \beta} = -2X^Ty + 2X^TX\beta \]

Set equal to zero (first-order condition): \[ -2X^Ty + 2X^TX\beta = 0 \implies X^TX\beta = X^Ty \]

Second-order condition: $\frac{\partial^2 S}{\partial \beta \partial \beta^T} = 2X^TX$, which is positive semidefinite. If $X$ has full column rank (rank $k$), then $X^TX$ is positive definite, confirming a global minimum.

Solution: Premultiply by $(X^TX)^{-1}$: \[ \hat{\beta} = (X^TX)^{-1}X^Ty \quad \square \]

Tip

Where it’s used: Foundation of all linear regression theory. The normal equations appear in every econometrics textbook and are the basis for understanding IV, GLS, and 2SLS as generalizations.

4 C. Probability and Statistics Proofs

4.1 Proof 10: $E[\bar{X}] = \mu$

Theorem. Let $X_1, \ldots, X_n$ be i.i.d. with $E[X_i] = \mu$. Then $E[\bar{X}] = \mu$.

Proof.

By linearity of expectation: \[ E[\bar{X}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n\mu}{n} = \mu \quad \square \]

Tip

Where it’s used: Establishing unbiasedness of the sample mean; basis for justifying MOM estimators.

4.2 Proof 11: Unbiasedness of Sample Variance

Theorem. Let $X_1, \ldots, X_n$ be i.i.d. with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$. Define: \[ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 \]

Then $E[S^2] = \sigma^2$.

Proof.

First, note the identity: \[ \sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2 \]

Derivation: $\sum(X_i - \bar{X})^2 = \sum X_i^2 - 2\bar{X}\sum X_i + n\bar{X}^2 = \sum X_i^2 - 2n\bar{X}^2 + n\bar{X}^2 = \sum X_i^2 - n\bar{X}^2$.

Now take expectations. Using $E[X_i^2] = \text{Var}(X_i) + (E[X_i])^2 = \sigma^2 + \mu^2$: \[ E\left[\sum_{i=1}^n X_i^2\right] = n(\sigma^2 + \mu^2) \]

For $\bar{X}$: $E[\bar{X}] = \mu$ and $\text{Var}(\bar{X}) = \sigma^2/n$, so $E[\bar{X}^2] = \text{Var}(\bar{X}) + (E[\bar{X}])^2 = \sigma^2/n + \mu^2$.

Thus: \[ E\left[\sum_{i=1}^n (X_i-\bar{X})^2\right] = n(\sigma^2+\mu^2) - n(\sigma^2/n + \mu^2) = n\sigma^2 + n\mu^2 - \sigma^2 - n\mu^2 = (n-1)\sigma^2 \]

Therefore: \[ E[S^2] = \frac{1}{n-1}E\left[\sum_{i=1}^n(X_i-\bar{X})^2\right] = \frac{(n-1)\sigma^2}{n-1} = \sigma^2 \quad \square \]

Tip

Where it’s used: Justifies dividing by $n-1$ in the sample variance formula; motivation for the $n-k$ denominator in OLS residual variance $\hat{\sigma}^2$.

4.3 Proof 12: Cauchy-Schwarz Inequality

Theorem. For random variables $X, Y$ with finite second moments: \[ |E[XY]|^2 \leq E[X^2]\, E[Y^2] \]

Proof.

For any $t \in \mathbb{R}$, since $(X + tY)^2 \geq 0$ pointwise: \[ E[(X + tY)^2] \geq 0 \]

Expanding: \[ E[X^2] + 2t\,E[XY] + t^2\,E[Y^2] \geq 0 \]

This is a quadratic in $t$: $at^2 + bt + c \geq 0$ where $a = E[Y^2]$, $b = 2E[XY]$, $c = E[X^2]$.

A quadratic $at^2 + bt + c \geq 0$ for all $t$ iff its discriminant $b^2 - 4ac \leq 0$ (assuming $a \geq 0$): \[ 4(E[XY])^2 - 4\,E[Y^2]\,E[X^2] \leq 0 \implies |E[XY]|^2 \leq E[X^2]\,E[Y^2] \quad \square \]

Tip

Where it’s used: Bounding covariances; proving correlation $|\rho| \leq 1$; proving that $\text{Cov}(X,Y)^2 \leq \text{Var}(X)\text{Var}(Y)$; information-theoretic bounds.

4.4 Proof 13: Jensen’s Inequality

Theorem. Let $\phi: \mathbb{R} \to \mathbb{R}$ be a convex function and $X$ a random variable with $E[|X|] < \infty$. Then: \[ \phi(E[X]) \leq E[\phi(X)] \]

Proof (supporting hyperplane argument).

Since $\phi$ is convex, for any point $\mu = E[X]$, there exists a supporting hyperplane at $\mu$: a constant $c$ such that: \[ \phi(x) \geq \phi(\mu) + c(x - \mu) \quad \forall x \in \mathbb{R} \]

(For differentiable $\phi$, take $c = \phi'(\mu)$; convexity guarantees such a $c$ always exists.)

Applying this inequality with $x = X$ (random): \[ \phi(X) \geq \phi(\mu) + c(X - \mu) \]

Taking expectations of both sides: \[ E[\phi(X)] \geq E[\phi(\mu) + c(X - \mu)] = \phi(\mu) + c\underbrace{(E[X] - \mu)}_{=0} = \phi(\mu) = \phi(E[X]) \quad \square \]

Tip

Where it’s used: Information theory (proof of KL divergence $\geq 0$); EM algorithm derivation (ELBO); proving that geometric mean $\leq$ arithmetic mean; risk bounds in learning theory.

4.5 Proof 14: KL Divergence $\geq 0$ (Gibbs’ Inequality)

Theorem. For distributions $P$ and $Q$ with $P \ll Q$ (P absolutely continuous w.r.t. Q): \[ D_{KL}(P \| Q) = E_P\left[\log\frac{p(X)}{q(X)}\right] \geq 0 \] with equality iff $P = Q$ a.e.

Proof (via Jensen’s inequality).

\[ -D_{KL}(P \| Q) = E_P\left[\log\frac{q(X)}{p(X)}\right] = E_P\left[\log\frac{q(X)}{p(X)}\right] \]

Since $\log$ is concave (equivalently, $-\log$ is convex), apply Jensen’s inequality with $\phi = -\log$ (convex): \[ E_P\left[\log\frac{q(X)}{p(X)}\right] \leq \log E_P\left[\frac{q(X)}{p(X)}\right] \]

Compute the right side: \[ E_P\left[\frac{q(X)}{p(X)}\right] = \int \frac{q(x)}{p(x)} p(x)\,dx = \int q(x)\,dx = 1 \]

Therefore: \[ -D_{KL}(P\|Q) \leq \log(1) = 0 \implies D_{KL}(P\|Q) \geq 0 \quad \square \]

Equality holds iff $q(x)/p(x) = $ constant a.e., which combined with both integrating to 1 gives $P = Q$.

Tip

Where it’s used: Variational inference (ELBO maximization); proving KL is a valid divergence measure; information theory (data processing inequality, channel capacity); EM algorithm.

4.6 Proof 15: Unbiasedness of OLS

Theorem. Under the linear model $y = X\beta + \varepsilon$ with strict exogeneity $E[\varepsilon | X] = 0$, the OLS estimator is unbiased: \[ E[\hat{\beta} | X] = \beta \]

Proof.

Substitute $y = X\beta + \varepsilon$ into $\hat{\beta} = (X^TX)^{-1}X^Ty$: \[ \hat{\beta} = (X^TX)^{-1}X^T(X\beta + \varepsilon) = (X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\varepsilon = \beta + (X^TX)^{-1}X^T\varepsilon \]

Taking conditional expectation given $X$: \[ E[\hat{\beta} | X] = \beta + (X^TX)^{-1}X^T \underbrace{E[\varepsilon | X]}_{= 0} = \beta \quad \square \]

Tip

Where it’s used: Fundamental justification for OLS. Strict exogeneity ($E[\varepsilon_i|X] = 0$, not just $E[\varepsilon_i|x_i] = 0$) is the key assumption — violated in dynamic panels and endogenous regressors.

4.7 Proof 16: Gauss-Markov Theorem (Sketch)

Theorem. Under classical OLS assumptions (linearity, strict exogeneity, no multicollinearity, homoskedasticity and no autocorrelation: $\text{Var}(\varepsilon | X) = \sigma^2 I$), the OLS estimator $\hat{\beta} = (X^TX)^{-1}X^Ty$ is BLUE: the Best Linear Unbiased Estimator.

Proof (sketch).

Consider any linear estimator $\tilde{\beta} = Cy$ where $C$ is an $k \times n$ matrix (possibly depending on $X$ but not $y$). For unbiasedness: \[ E[\tilde{\beta}|X] = CE[y|X] = CX\beta \stackrel{!}{=} \beta \quad \forall \beta \] requires $CX = I_k$.

Write $C = (X^TX)^{-1}X^T + D$ where $D$ is chosen so that $CX = I$, i.e., $DX = 0$.

The variance of $\tilde{\beta}$: \[ \text{Var}(\tilde{\beta}|X) = C\,\text{Var}(y|X)\,C^T = \sigma^2 CC^T \]

Expanding $CC^T$: \[ CC^T = \left[(X^TX)^{-1}X^T + D\right]\left[X(X^TX)^{-1} + D^T\right] \] \[ = (X^TX)^{-1} + (X^TX)^{-1}X^TD^T + DX(X^TX)^{-1} + DD^T \]

Since $DX = 0$, the middle two terms vanish: \[ CC^T = (X^TX)^{-1} + DD^T \]

Therefore: \[ \text{Var}(\tilde{\beta}|X) = \sigma^2(X^TX)^{-1} + \sigma^2 DD^T = \text{Var}(\hat{\beta}_{OLS}|X) + \sigma^2 DD^T \]

Since $DD^T$ is positive semidefinite ($\mathbf{v}^T DD^T \mathbf{v} = \|D^T\mathbf{v}\|^2 \geq 0$): \[ \text{Var}(\tilde{\beta}|X) \succeq \text{Var}(\hat{\beta}_{OLS}|X) \quad \square \]

Tip

Where it’s used: Fundamental justification for using OLS when classical assumptions hold. The theorem breaks down under heteroskedasticity (use FGLS/HC standard errors) or autocorrelation (use HAC/GLS).

Back to Appendix Overview | Previous: Greek Alphabet | Next: Notation Guide

--- title: "Common Mathematical Proofs" subtitle: "Bukti-bukti yang Sering Direferensikan" --- ## Overview This page collects proofs for results that are frequently cited in econometrics and ML textbooks without full derivation. Each proof is presented with a theorem statement, step-by-step derivation, and a note on where the result is used. --- ## A. Linear Algebra Proofs ### Proof 1: $(AB)^T = B^T A^T$ **Theorem.** For matrices $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$: $$ (AB)^T = B^T A^T $$ **Proof.** Let $C = AB$, so $C \in \mathbb{R}^{m \times p}$ with entries: $$ C_{ij} = (AB)_{ij} = \sum_{k=1}^n A_{ik} B_{kj} $$ The transpose $C^T$ has entries: $$ (C^T)_{ij} = C_{ji} = \sum_{k=1}^n A_{jk} B_{ki} $$ Now compute $(B^T A^T)_{ij}$. We have $(B^T)_{ik} = B_{ki}$ and $(A^T)_{kj} = A_{jk}$, so: $$ (B^T A^T)_{ij} = \sum_{k=1}^n (B^T)_{ik} (A^T)_{kj} = \sum_{k=1}^n B_{ki} A_{jk} = \sum_{k=1}^n A_{jk} B_{ki} $$ Since $(C^T)_{ij} = (B^T A^T)_{ij}$ for all $i, j$, we conclude $(AB)^T = B^T A^T$. $\square$ ::: {.callout-tip} **Where it's used:** Constantly. Transposing products appears in OLS ($\hat{\beta} = (X^TX)^{-1}X^Ty$), matrix calculus derivations, and any manipulation of quadratic forms. ::: --- ### Proof 2: $(AB)^{-1} = B^{-1} A^{-1}$ **Theorem.** For invertible matrices $A, B \in \mathbb{R}^{n \times n}$: $$ (AB)^{-1} = B^{-1} A^{-1} $$ **Proof.** We show that $B^{-1}A^{-1}$ is the inverse of $AB$ by verifying both left and right inverse conditions. **Right inverse:** $$ (AB)(B^{-1}A^{-1}) = A(BB^{-1})A^{-1} = A \cdot I \cdot A^{-1} = AA^{-1} = I $$ **Left inverse:** $$ (B^{-1}A^{-1})(AB) = B^{-1}(A^{-1}A)B = B^{-1} \cdot I \cdot B = B^{-1}B = I $$ Since both products equal $I$, $B^{-1}A^{-1}$ is the (unique) inverse of $AB$. $\square$ ::: {.callout-tip} **Where it's used:** Manipulating the OLS formula, GLS, and any expression involving inverse of a matrix product. Note the reversal of order — a common source of errors. ::: --- ### Proof 3: $\text{tr}(AB) = \text{tr}(BA)$ **Theorem.** For $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times m}$: $$ \text{tr}(AB) = \text{tr}(BA) $$ **Proof.** By definition, $\text{tr}(AB) = \sum_{i=1}^m (AB)_{ii}$. Expanding: $$ \text{tr}(AB) = \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \sum_{k=1}^n A_{ik} B_{ki} $$ Rearranging the sum (finite sums commute): $$ = \sum_{k=1}^n \sum_{i=1}^m B_{ki} A_{ik} = \sum_{k=1}^n (BA)_{kk} = \text{tr}(BA) $$ $\square$ ::: {.callout-tip} **Where it's used:** Matrix calculus (deriving $\frac{\partial}{\partial X}\text{tr}(AX) = A^T$), simplifying likelihood expressions, information geometry. The cyclic property extends: $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$. ::: --- ### Proof 4: $\det(AB) = \det(A)\det(B)$ **Theorem.** For $A, B \in \mathbb{R}^{n \times n}$: $$ \det(AB) = \det(A)\det(B) $$ **Proof (via elementary matrices).** Every invertible matrix can be expressed as a product of elementary matrices $E_1, \ldots, E_r$ (row operations). The determinant of an elementary matrix is a simple scalar, and $\det(E_i A) = \det(E_i)\det(A)$ holds by direct computation for each type of row operation. By induction, $\det(AB) = \det(A)\det(B)$ for invertible $A$. For singular $A$: if $\det(A) = 0$, then $A$ is singular, meaning its columns are linearly dependent. This means $AB$ also has linearly dependent columns, so $\det(AB) = 0 = 0 \cdot \det(B) = \det(A)\det(B)$. **Geometric intuition:** $\det(A)$ measures the volume scaling factor of the linear map $A$. Composing two linear maps scales volume by $\det(A) \cdot \det(B)$. $\square$ ::: {.callout-tip} **Where it's used:** Likelihood calculations involving multivariate normal densities (which include $\det(\Sigma)$); Jacobian of change-of-variables; matrix determinant lemma. ::: --- ### Proof 5: $\text{tr}(A) = \sum_i \lambda_i$ **Theorem.** The trace of a square matrix equals the sum of its eigenvalues (counted with multiplicity). **Proof.** The characteristic polynomial of $A \in \mathbb{R}^{n \times n}$ is: $$ p(\lambda) = \det(\lambda I - A) $$ The eigenvalues $\lambda_1, \ldots, \lambda_n$ are the roots, so: $$ p(\lambda) = (\lambda - \lambda_1)(\lambda - \lambda_2)\cdots(\lambda - \lambda_n) $$ Expanding the right side, the coefficient of $\lambda^{n-1}$ is $-(\lambda_1 + \lambda_2 + \cdots + \lambda_n)$. On the left side, $\det(\lambda I - A)$ expands as follows. The only term contributing to $\lambda^{n-1}$ comes from the product of diagonal entries $(\lambda - a_{11})(\lambda - a_{22})\cdots(\lambda - a_{nn})$, which contributes $-(\sum_i a_{ii})\lambda^{n-1} = -\text{tr}(A)\lambda^{n-1}$. Matching coefficients of $\lambda^{n-1}$: $$ -\text{tr}(A) = -(\lambda_1 + \cdots + \lambda_n) \implies \text{tr}(A) = \sum_{i=1}^n \lambda_i $$ $\square$ ::: {.callout-tip} **Where it's used:** Checking computations, understanding the trace as an eigenvalue-based quantity, PCA (variance explained = sum of eigenvalues of $\Sigma$). ::: --- ### Proof 6: $\det(A) = \prod_i \lambda_i$ **Theorem.** The determinant of a square matrix equals the product of its eigenvalues. **Proof.** Using the characteristic polynomial from Proof 5, evaluate $p(\lambda)$ at $\lambda = 0$: $$ p(0) = \det(0 \cdot I - A) = \det(-A) = (-1)^n \det(A) $$ From the factored form: $$ p(0) = (0 - \lambda_1)(0 - \lambda_2)\cdots(0-\lambda_n) = (-1)^n \lambda_1 \lambda_2 \cdots \lambda_n $$ Therefore: $$ (-1)^n \det(A) = (-1)^n \prod_i \lambda_i \implies \det(A) = \prod_{i=1}^n \lambda_i $$ $\square$ ::: {.callout-tip} **Where it's used:** A matrix is invertible iff $\det(A) \neq 0$ iff all eigenvalues are nonzero. Used to check positive definiteness ($\det(A) > 0$ is necessary but not sufficient — need all $\lambda_i > 0$). ::: --- ## B. Calculus and Optimization Proofs ### Proof 7: $\frac{\partial}{\partial \mathbf{x}}(A\mathbf{x}) = A^T$ **Theorem.** For $A \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^n$, let $f(\mathbf{x}) = A\mathbf{x}$. Then: $$ \frac{\partial f}{\partial \mathbf{x}} = A^T $$ in the sense that $\nabla_{x_j}(A\mathbf{x})_i = A_{ij}$, so the Jacobian of $A\mathbf{x}$ is $A$, and the gradient of a scalar $\mathbf{a}^T A \mathbf{x}$ w.r.t. $\mathbf{x}$ is $A^T \mathbf{a}$. **Proof.** Consider the scalar function $g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x}$ where $\mathbf{a} \in \mathbb{R}^m$ is fixed. $$ g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x} = \sum_{i=1}^m a_i (A\mathbf{x})_i = \sum_{i=1}^m a_i \sum_{j=1}^n A_{ij} x_j = \sum_{j=1}^n \left(\sum_{i=1}^m A_{ij} a_i\right) x_j $$ The partial derivative with respect to $x_k$: $$ \frac{\partial g}{\partial x_k} = \sum_{i=1}^m A_{ik} a_i = (A^T \mathbf{a})_k $$ Therefore $\nabla_\mathbf{x} g = A^T \mathbf{a}$. In particular, if $g(\mathbf{x}) = \mathbf{e}_j^T A \mathbf{x} = (A\mathbf{x})_j$, then $\nabla_\mathbf{x}(A\mathbf{x})_j = (A^T)_{\cdot j} = $ the $j$-th column of $A^T$. $\square$ ::: {.callout-tip} **Where it's used:** Foundation of all matrix calculus. Essential for deriving gradient descent updates, normal equations, and variational problems. ::: --- ### Proof 8: $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}$ (for symmetric $A$) **Theorem.** For symmetric $A \in \mathbb{R}^{n \times n}$ and $\mathbf{x} \in \mathbb{R}^n$: $$ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x} $$ **Proof.** Write $f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j$. Taking the partial derivative with respect to $x_k$: $$ \frac{\partial f}{\partial x_k} = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n A_{ik} x_i $$ The first sum is $(A\mathbf{x})_k$ and the second sum is $(A^T \mathbf{x})_k$. Using $A^T = A$ (symmetry): $$ \frac{\partial f}{\partial x_k} = (A\mathbf{x})_k + (A\mathbf{x})_k = 2(A\mathbf{x})_k $$ In vector form: $\nabla_\mathbf{x}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}$. $\square$ **Without symmetry:** When $A$ is not symmetric, $\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}$. ::: {.callout-tip} **Where it's used:** Deriving OLS (the SSR is a quadratic form); quadratic programming; Mahalanobis distance derivatives. ::: --- ### Proof 9: OLS Normal Equations **Theorem.** The OLS estimator $\hat{\beta}$ minimizes $S(\beta) = (y - X\beta)^T(y - X\beta)$ and satisfies: $$ X^T X \hat{\beta} = X^T y $$ When $X^TX$ is invertible: $\hat{\beta} = (X^TX)^{-1}X^Ty$. **Proof.** Expand $S(\beta)$: $$ S(\beta) = (y - X\beta)^T(y - X\beta) = y^Ty - y^TX\beta - \beta^TX^Ty + \beta^TX^TX\beta $$ Note $y^TX\beta = (y^TX\beta)^T = \beta^TX^Ty$ (scalar equals its own transpose), so: $$ S(\beta) = y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta $$ Take the gradient with respect to $\beta$ using results from Proofs 7 and 8. Note $X^TX$ is symmetric: $$ \frac{\partial S}{\partial \beta} = -2X^Ty + 2X^TX\beta $$ Set equal to zero (first-order condition): $$ -2X^Ty + 2X^TX\beta = 0 \implies X^TX\beta = X^Ty $$ **Second-order condition:** $\frac{\partial^2 S}{\partial \beta \partial \beta^T} = 2X^TX$, which is positive semidefinite. If $X$ has full column rank (rank $k$), then $X^TX$ is positive definite, confirming a global minimum. **Solution:** Premultiply by $(X^TX)^{-1}$: $$ \hat{\beta} = (X^TX)^{-1}X^Ty \quad \square $$ ::: {.callout-tip} **Where it's used:** Foundation of all linear regression theory. The normal equations appear in every econometrics textbook and are the basis for understanding IV, GLS, and 2SLS as generalizations. ::: --- ## C. Probability and Statistics Proofs ### Proof 10: $E[\bar{X}] = \mu$ **Theorem.** Let $X_1, \ldots, X_n$ be i.i.d. with $E[X_i] = \mu$. Then $E[\bar{X}] = \mu$. **Proof.** By linearity of expectation: $$ E[\bar{X}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n\mu}{n} = \mu \quad \square $$ ::: {.callout-tip} **Where it's used:** Establishing unbiasedness of the sample mean; basis for justifying MOM estimators. ::: --- ### Proof 11: Unbiasedness of Sample Variance **Theorem.** Let $X_1, \ldots, X_n$ be i.i.d. with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2$. Define: $$ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 $$ Then $E[S^2] = \sigma^2$. **Proof.** First, note the identity: $$ \sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2 $$ *Derivation:* $\sum(X_i - \bar{X})^2 = \sum X_i^2 - 2\bar{X}\sum X_i + n\bar{X}^2 = \sum X_i^2 - 2n\bar{X}^2 + n\bar{X}^2 = \sum X_i^2 - n\bar{X}^2$. Now take expectations. Using $E[X_i^2] = \text{Var}(X_i) + (E[X_i])^2 = \sigma^2 + \mu^2$: $$ E\left[\sum_{i=1}^n X_i^2\right] = n(\sigma^2 + \mu^2) $$ For $\bar{X}$: $E[\bar{X}] = \mu$ and $\text{Var}(\bar{X}) = \sigma^2/n$, so $E[\bar{X}^2] = \text{Var}(\bar{X}) + (E[\bar{X}])^2 = \sigma^2/n + \mu^2$. Thus: $$ E\left[\sum_{i=1}^n (X_i-\bar{X})^2\right] = n(\sigma^2+\mu^2) - n(\sigma^2/n + \mu^2) = n\sigma^2 + n\mu^2 - \sigma^2 - n\mu^2 = (n-1)\sigma^2 $$ Therefore: $$ E[S^2] = \frac{1}{n-1}E\left[\sum_{i=1}^n(X_i-\bar{X})^2\right] = \frac{(n-1)\sigma^2}{n-1} = \sigma^2 \quad \square $$ ::: {.callout-tip} **Where it's used:** Justifies dividing by $n-1$ in the sample variance formula; motivation for the $n-k$ denominator in OLS residual variance $\hat{\sigma}^2$. ::: --- ### Proof 12: Cauchy-Schwarz Inequality **Theorem.** For random variables $X, Y$ with finite second moments: $$ |E[XY]|^2 \leq E[X^2]\, E[Y^2] $$ **Proof.** For any $t \in \mathbb{R}$, since $(X + tY)^2 \geq 0$ pointwise: $$ E[(X + tY)^2] \geq 0 $$ Expanding: $$ E[X^2] + 2t\,E[XY] + t^2\,E[Y^2] \geq 0 $$ This is a quadratic in $t$: $at^2 + bt + c \geq 0$ where $a = E[Y^2]$, $b = 2E[XY]$, $c = E[X^2]$. A quadratic $at^2 + bt + c \geq 0$ for all $t$ iff its discriminant $b^2 - 4ac \leq 0$ (assuming $a \geq 0$): $$ 4(E[XY])^2 - 4\,E[Y^2]\,E[X^2] \leq 0 \implies |E[XY]|^2 \leq E[X^2]\,E[Y^2] \quad \square $$ ::: {.callout-tip} **Where it's used:** Bounding covariances; proving correlation $|\rho| \leq 1$; proving that $\text{Cov}(X,Y)^2 \leq \text{Var}(X)\text{Var}(Y)$; information-theoretic bounds. ::: --- ### Proof 13: Jensen's Inequality **Theorem.** Let $\phi: \mathbb{R} \to \mathbb{R}$ be a convex function and $X$ a random variable with $E[|X|] < \infty$. Then: $$ \phi(E[X]) \leq E[\phi(X)] $$ **Proof (supporting hyperplane argument).** Since $\phi$ is convex, for any point $\mu = E[X]$, there exists a supporting hyperplane at $\mu$: a constant $c$ such that: $$ \phi(x) \geq \phi(\mu) + c(x - \mu) \quad \forall x \in \mathbb{R} $$ (For differentiable $\phi$, take $c = \phi'(\mu)$; convexity guarantees such a $c$ always exists.) Applying this inequality with $x = X$ (random): $$ \phi(X) \geq \phi(\mu) + c(X - \mu) $$ Taking expectations of both sides: $$ E[\phi(X)] \geq E[\phi(\mu) + c(X - \mu)] = \phi(\mu) + c\underbrace{(E[X] - \mu)}_{=0} = \phi(\mu) = \phi(E[X]) \quad \square $$ ::: {.callout-tip} **Where it's used:** Information theory (proof of KL divergence $\geq 0$); EM algorithm derivation (ELBO); proving that geometric mean $\leq$ arithmetic mean; risk bounds in learning theory. ::: --- ### Proof 14: KL Divergence $\geq 0$ (Gibbs' Inequality) **Theorem.** For distributions $P$ and $Q$ with $P \ll Q$ (P absolutely continuous w.r.t. Q): $$ D_{KL}(P \| Q) = E_P\left[\log\frac{p(X)}{q(X)}\right] \geq 0 $$ with equality iff $P = Q$ a.e. **Proof (via Jensen's inequality).** $$ -D_{KL}(P \| Q) = E_P\left[\log\frac{q(X)}{p(X)}\right] = E_P\left[\log\frac{q(X)}{p(X)}\right] $$ Since $\log$ is concave (equivalently, $-\log$ is convex), apply Jensen's inequality with $\phi = -\log$ (convex): $$ E_P\left[\log\frac{q(X)}{p(X)}\right] \leq \log E_P\left[\frac{q(X)}{p(X)}\right] $$ Compute the right side: $$ E_P\left[\frac{q(X)}{p(X)}\right] = \int \frac{q(x)}{p(x)} p(x)\,dx = \int q(x)\,dx = 1 $$ Therefore: $$ -D_{KL}(P\|Q) \leq \log(1) = 0 \implies D_{KL}(P\|Q) \geq 0 \quad \square $$ Equality holds iff $q(x)/p(x) = $ constant a.e., which combined with both integrating to 1 gives $P = Q$. ::: {.callout-tip} **Where it's used:** Variational inference (ELBO maximization); proving KL is a valid divergence measure; information theory (data processing inequality, channel capacity); EM algorithm. ::: --- ### Proof 15: Unbiasedness of OLS **Theorem.** Under the linear model $y = X\beta + \varepsilon$ with strict exogeneity $E[\varepsilon | X] = 0$, the OLS estimator is unbiased: $$ E[\hat{\beta} | X] = \beta $$ **Proof.** Substitute $y = X\beta + \varepsilon$ into $\hat{\beta} = (X^TX)^{-1}X^Ty$: $$ \hat{\beta} = (X^TX)^{-1}X^T(X\beta + \varepsilon) = (X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\varepsilon = \beta + (X^TX)^{-1}X^T\varepsilon $$ Taking conditional expectation given $X$: $$ E[\hat{\beta} | X] = \beta + (X^TX)^{-1}X^T \underbrace{E[\varepsilon | X]}_{= 0} = \beta \quad \square $$ ::: {.callout-tip} **Where it's used:** Fundamental justification for OLS. Strict exogeneity ($E[\varepsilon_i|X] = 0$, not just $E[\varepsilon_i|x_i] = 0$) is the key assumption — violated in dynamic panels and endogenous regressors. ::: --- ### Proof 16: Gauss-Markov Theorem (Sketch) **Theorem.** Under classical OLS assumptions (linearity, strict exogeneity, no multicollinearity, homoskedasticity and no autocorrelation: $\text{Var}(\varepsilon | X) = \sigma^2 I$), the OLS estimator $\hat{\beta} = (X^TX)^{-1}X^Ty$ is BLUE: the Best Linear Unbiased Estimator. **Proof (sketch).** Consider any linear estimator $\tilde{\beta} = Cy$ where $C$ is an $k \times n$ matrix (possibly depending on $X$ but not $y$). For unbiasedness: $$ E[\tilde{\beta}|X] = CE[y|X] = CX\beta \stackrel{!}{=} \beta \quad \forall \beta $$ requires $CX = I_k$. Write $C = (X^TX)^{-1}X^T + D$ where $D$ is chosen so that $CX = I$, i.e., $DX = 0$. The variance of $\tilde{\beta}$: $$ \text{Var}(\tilde{\beta}|X) = C\,\text{Var}(y|X)\,C^T = \sigma^2 CC^T $$ Expanding $CC^T$: $$ CC^T = \left[(X^TX)^{-1}X^T + D\right]\left[X(X^TX)^{-1} + D^T\right] $$ $$ = (X^TX)^{-1} + (X^TX)^{-1}X^TD^T + DX(X^TX)^{-1} + DD^T $$ Since $DX = 0$, the middle two terms vanish: $$ CC^T = (X^TX)^{-1} + DD^T $$ Therefore: $$ \text{Var}(\tilde{\beta}|X) = \sigma^2(X^TX)^{-1} + \sigma^2 DD^T = \text{Var}(\hat{\beta}_{OLS}|X) + \sigma^2 DD^T $$ Since $DD^T$ is positive semidefinite ($\mathbf{v}^T DD^T \mathbf{v} = \|D^T\mathbf{v}\|^2 \geq 0$): $$ \text{Var}(\tilde{\beta}|X) \succeq \text{Var}(\hat{\beta}_{OLS}|X) \quad \square $$ ::: {.callout-tip} **Where it's used:** Fundamental justification for using OLS when classical assumptions hold. The theorem breaks down under heteroskedasticity (use FGLS/HC standard errors) or autocorrelation (use HAC/GLS). ::: --- *Back to [Appendix Overview](index.qmd) | Previous: [Greek Alphabet](greek-alphabet.qmd) | Next: [Notation Guide](notation-guide.qmd)*

Common Mathematical Proofs

1 Overview

2 A. Linear Algebra Proofs

2.1 Proof 1: \((AB)^T = B^T A^T\)

2.2 Proof 2: \((AB)^{-1} = B^{-1} A^{-1}\)

2.3 Proof 3: \(\text{tr}(AB) = \text{tr}(BA)\)

2.4 Proof 4: \(\det(AB) = \det(A)\det(B)\)

2.5 Proof 5: \(\text{tr}(A) = \sum_i \lambda_i\)

2.6 Proof 6: \(\det(A) = \prod_i \lambda_i\)

3 B. Calculus and Optimization Proofs

3.1 Proof 7: \(\frac{\partial}{\partial \mathbf{x}}(A\mathbf{x}) = A^T\)

3.2 Proof 8: \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}\) (for symmetric \(A\))

3.3 Proof 9: OLS Normal Equations

4 C. Probability and Statistics Proofs

4.1 Proof 10: \(E[\bar{X}] = \mu\)

4.2 Proof 11: Unbiasedness of Sample Variance

4.3 Proof 12: Cauchy-Schwarz Inequality

4.4 Proof 13: Jensen’s Inequality

4.5 Proof 14: KL Divergence \(\geq 0\) (Gibbs’ Inequality)

4.6 Proof 15: Unbiasedness of OLS

4.7 Proof 16: Gauss-Markov Theorem (Sketch)