Common Mathematical Proofs

Bukti-bukti yang Sering Direferensikan

1 Overview

This page collects proofs for results that are frequently cited in econometrics and ML textbooks without full derivation. Each proof is presented with a theorem statement, step-by-step derivation, and a note on where the result is used.


2 A. Linear Algebra Proofs

2.1 Proof 1: \((AB)^T = B^T A^T\)

Theorem. For matrices \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times p}\): \[ (AB)^T = B^T A^T \]

Proof.

Let \(C = AB\), so \(C \in \mathbb{R}^{m \times p}\) with entries: \[ C_{ij} = (AB)_{ij} = \sum_{k=1}^n A_{ik} B_{kj} \]

The transpose \(C^T\) has entries: \[ (C^T)_{ij} = C_{ji} = \sum_{k=1}^n A_{jk} B_{ki} \]

Now compute \((B^T A^T)_{ij}\). We have \((B^T)_{ik} = B_{ki}\) and \((A^T)_{kj} = A_{jk}\), so: \[ (B^T A^T)_{ij} = \sum_{k=1}^n (B^T)_{ik} (A^T)_{kj} = \sum_{k=1}^n B_{ki} A_{jk} = \sum_{k=1}^n A_{jk} B_{ki} \]

Since \((C^T)_{ij} = (B^T A^T)_{ij}\) for all \(i, j\), we conclude \((AB)^T = B^T A^T\). \(\square\)

Tip

Where it’s used: Constantly. Transposing products appears in OLS (\(\hat{\beta} = (X^TX)^{-1}X^Ty\)), matrix calculus derivations, and any manipulation of quadratic forms.


2.2 Proof 2: \((AB)^{-1} = B^{-1} A^{-1}\)

Theorem. For invertible matrices \(A, B \in \mathbb{R}^{n \times n}\): \[ (AB)^{-1} = B^{-1} A^{-1} \]

Proof.

We show that \(B^{-1}A^{-1}\) is the inverse of \(AB\) by verifying both left and right inverse conditions.

Right inverse: \[ (AB)(B^{-1}A^{-1}) = A(BB^{-1})A^{-1} = A \cdot I \cdot A^{-1} = AA^{-1} = I \]

Left inverse: \[ (B^{-1}A^{-1})(AB) = B^{-1}(A^{-1}A)B = B^{-1} \cdot I \cdot B = B^{-1}B = I \]

Since both products equal \(I\), \(B^{-1}A^{-1}\) is the (unique) inverse of \(AB\). \(\square\)

Tip

Where it’s used: Manipulating the OLS formula, GLS, and any expression involving inverse of a matrix product. Note the reversal of order — a common source of errors.


2.3 Proof 3: \(\text{tr}(AB) = \text{tr}(BA)\)

Theorem. For \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{n \times m}\): \[ \text{tr}(AB) = \text{tr}(BA) \]

Proof.

By definition, \(\text{tr}(AB) = \sum_{i=1}^m (AB)_{ii}\). Expanding: \[ \text{tr}(AB) = \sum_{i=1}^m (AB)_{ii} = \sum_{i=1}^m \sum_{k=1}^n A_{ik} B_{ki} \]

Rearranging the sum (finite sums commute): \[ = \sum_{k=1}^n \sum_{i=1}^m B_{ki} A_{ik} = \sum_{k=1}^n (BA)_{kk} = \text{tr}(BA) \]

\(\square\)

Tip

Where it’s used: Matrix calculus (deriving \(\frac{\partial}{\partial X}\text{tr}(AX) = A^T\)), simplifying likelihood expressions, information geometry. The cyclic property extends: \(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\).


2.4 Proof 4: \(\det(AB) = \det(A)\det(B)\)

Theorem. For \(A, B \in \mathbb{R}^{n \times n}\): \[ \det(AB) = \det(A)\det(B) \]

Proof (via elementary matrices).

Every invertible matrix can be expressed as a product of elementary matrices \(E_1, \ldots, E_r\) (row operations). The determinant of an elementary matrix is a simple scalar, and \(\det(E_i A) = \det(E_i)\det(A)\) holds by direct computation for each type of row operation. By induction, \(\det(AB) = \det(A)\det(B)\) for invertible \(A\).

For singular \(A\): if \(\det(A) = 0\), then \(A\) is singular, meaning its columns are linearly dependent. This means \(AB\) also has linearly dependent columns, so \(\det(AB) = 0 = 0 \cdot \det(B) = \det(A)\det(B)\).

Geometric intuition: \(\det(A)\) measures the volume scaling factor of the linear map \(A\). Composing two linear maps scales volume by \(\det(A) \cdot \det(B)\). \(\square\)

Tip

Where it’s used: Likelihood calculations involving multivariate normal densities (which include \(\det(\Sigma)\)); Jacobian of change-of-variables; matrix determinant lemma.


2.5 Proof 5: \(\text{tr}(A) = \sum_i \lambda_i\)

Theorem. The trace of a square matrix equals the sum of its eigenvalues (counted with multiplicity).

Proof.

The characteristic polynomial of \(A \in \mathbb{R}^{n \times n}\) is: \[ p(\lambda) = \det(\lambda I - A) \]

The eigenvalues \(\lambda_1, \ldots, \lambda_n\) are the roots, so: \[ p(\lambda) = (\lambda - \lambda_1)(\lambda - \lambda_2)\cdots(\lambda - \lambda_n) \]

Expanding the right side, the coefficient of \(\lambda^{n-1}\) is \(-(\lambda_1 + \lambda_2 + \cdots + \lambda_n)\).

On the left side, \(\det(\lambda I - A)\) expands as follows. The only term contributing to \(\lambda^{n-1}\) comes from the product of diagonal entries \((\lambda - a_{11})(\lambda - a_{22})\cdots(\lambda - a_{nn})\), which contributes \(-(\sum_i a_{ii})\lambda^{n-1} = -\text{tr}(A)\lambda^{n-1}\).

Matching coefficients of \(\lambda^{n-1}\): \[ -\text{tr}(A) = -(\lambda_1 + \cdots + \lambda_n) \implies \text{tr}(A) = \sum_{i=1}^n \lambda_i \]

\(\square\)

Tip

Where it’s used: Checking computations, understanding the trace as an eigenvalue-based quantity, PCA (variance explained = sum of eigenvalues of \(\Sigma\)).


2.6 Proof 6: \(\det(A) = \prod_i \lambda_i\)

Theorem. The determinant of a square matrix equals the product of its eigenvalues.

Proof.

Using the characteristic polynomial from Proof 5, evaluate \(p(\lambda)\) at \(\lambda = 0\): \[ p(0) = \det(0 \cdot I - A) = \det(-A) = (-1)^n \det(A) \]

From the factored form: \[ p(0) = (0 - \lambda_1)(0 - \lambda_2)\cdots(0-\lambda_n) = (-1)^n \lambda_1 \lambda_2 \cdots \lambda_n \]

Therefore: \[ (-1)^n \det(A) = (-1)^n \prod_i \lambda_i \implies \det(A) = \prod_{i=1}^n \lambda_i \]

\(\square\)

Tip

Where it’s used: A matrix is invertible iff \(\det(A) \neq 0\) iff all eigenvalues are nonzero. Used to check positive definiteness (\(\det(A) > 0\) is necessary but not sufficient — need all \(\lambda_i > 0\)).


3 B. Calculus and Optimization Proofs

3.1 Proof 7: \(\frac{\partial}{\partial \mathbf{x}}(A\mathbf{x}) = A^T\)

Theorem. For \(A \in \mathbb{R}^{m \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\), let \(f(\mathbf{x}) = A\mathbf{x}\). Then: \[ \frac{\partial f}{\partial \mathbf{x}} = A^T \] in the sense that \(\nabla_{x_j}(A\mathbf{x})_i = A_{ij}\), so the Jacobian of \(A\mathbf{x}\) is \(A\), and the gradient of a scalar \(\mathbf{a}^T A \mathbf{x}\) w.r.t. \(\mathbf{x}\) is \(A^T \mathbf{a}\).

Proof.

Consider the scalar function \(g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x}\) where \(\mathbf{a} \in \mathbb{R}^m\) is fixed.

\[ g(\mathbf{x}) = \mathbf{a}^T A \mathbf{x} = \sum_{i=1}^m a_i (A\mathbf{x})_i = \sum_{i=1}^m a_i \sum_{j=1}^n A_{ij} x_j = \sum_{j=1}^n \left(\sum_{i=1}^m A_{ij} a_i\right) x_j \]

The partial derivative with respect to \(x_k\): \[ \frac{\partial g}{\partial x_k} = \sum_{i=1}^m A_{ik} a_i = (A^T \mathbf{a})_k \]

Therefore \(\nabla_\mathbf{x} g = A^T \mathbf{a}\). In particular, if \(g(\mathbf{x}) = \mathbf{e}_j^T A \mathbf{x} = (A\mathbf{x})_j\), then $_(A)j = (A^T){j} = $ the \(j\)-th column of \(A^T\). \(\square\)

Tip

Where it’s used: Foundation of all matrix calculus. Essential for deriving gradient descent updates, normal equations, and variational problems.


3.2 Proof 8: \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}\) (for symmetric \(A\))

Theorem. For symmetric \(A \in \mathbb{R}^{n \times n}\) and \(\mathbf{x} \in \mathbb{R}^n\): \[ \frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x} \]

Proof.

Write \(f(\mathbf{x}) = \mathbf{x}^T A \mathbf{x} = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j\).

Taking the partial derivative with respect to \(x_k\): \[ \frac{\partial f}{\partial x_k} = \sum_{j=1}^n A_{kj} x_j + \sum_{i=1}^n A_{ik} x_i \]

The first sum is \((A\mathbf{x})_k\) and the second sum is \((A^T \mathbf{x})_k\). Using \(A^T = A\) (symmetry): \[ \frac{\partial f}{\partial x_k} = (A\mathbf{x})_k + (A\mathbf{x})_k = 2(A\mathbf{x})_k \]

In vector form: \(\nabla_\mathbf{x}(\mathbf{x}^T A \mathbf{x}) = 2A\mathbf{x}\). \(\square\)

Without symmetry: When \(A\) is not symmetric, \(\frac{\partial}{\partial \mathbf{x}}(\mathbf{x}^T A \mathbf{x}) = (A + A^T)\mathbf{x}\).

Tip

Where it’s used: Deriving OLS (the SSR is a quadratic form); quadratic programming; Mahalanobis distance derivatives.


3.3 Proof 9: OLS Normal Equations

Theorem. The OLS estimator \(\hat{\beta}\) minimizes \(S(\beta) = (y - X\beta)^T(y - X\beta)\) and satisfies: \[ X^T X \hat{\beta} = X^T y \]

When \(X^TX\) is invertible: \(\hat{\beta} = (X^TX)^{-1}X^Ty\).

Proof.

Expand \(S(\beta)\): \[ S(\beta) = (y - X\beta)^T(y - X\beta) = y^Ty - y^TX\beta - \beta^TX^Ty + \beta^TX^TX\beta \]

Note \(y^TX\beta = (y^TX\beta)^T = \beta^TX^Ty\) (scalar equals its own transpose), so: \[ S(\beta) = y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta \]

Take the gradient with respect to \(\beta\) using results from Proofs 7 and 8. Note \(X^TX\) is symmetric: \[ \frac{\partial S}{\partial \beta} = -2X^Ty + 2X^TX\beta \]

Set equal to zero (first-order condition): \[ -2X^Ty + 2X^TX\beta = 0 \implies X^TX\beta = X^Ty \]

Second-order condition: \(\frac{\partial^2 S}{\partial \beta \partial \beta^T} = 2X^TX\), which is positive semidefinite. If \(X\) has full column rank (rank \(k\)), then \(X^TX\) is positive definite, confirming a global minimum.

Solution: Premultiply by \((X^TX)^{-1}\): \[ \hat{\beta} = (X^TX)^{-1}X^Ty \quad \square \]

Tip

Where it’s used: Foundation of all linear regression theory. The normal equations appear in every econometrics textbook and are the basis for understanding IV, GLS, and 2SLS as generalizations.


4 C. Probability and Statistics Proofs

4.1 Proof 10: \(E[\bar{X}] = \mu\)

Theorem. Let \(X_1, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\). Then \(E[\bar{X}] = \mu\).

Proof.

By linearity of expectation: \[ E[\bar{X}] = E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n\mu}{n} = \mu \quad \square \]

Tip

Where it’s used: Establishing unbiasedness of the sample mean; basis for justifying MOM estimators.


4.2 Proof 11: Unbiasedness of Sample Variance

Theorem. Let \(X_1, \ldots, X_n\) be i.i.d. with \(E[X_i] = \mu\) and \(\text{Var}(X_i) = \sigma^2\). Define: \[ S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 \]

Then \(E[S^2] = \sigma^2\).

Proof.

First, note the identity: \[ \sum_{i=1}^n (X_i - \bar{X})^2 = \sum_{i=1}^n X_i^2 - n\bar{X}^2 \]

Derivation: \(\sum(X_i - \bar{X})^2 = \sum X_i^2 - 2\bar{X}\sum X_i + n\bar{X}^2 = \sum X_i^2 - 2n\bar{X}^2 + n\bar{X}^2 = \sum X_i^2 - n\bar{X}^2\).

Now take expectations. Using \(E[X_i^2] = \text{Var}(X_i) + (E[X_i])^2 = \sigma^2 + \mu^2\): \[ E\left[\sum_{i=1}^n X_i^2\right] = n(\sigma^2 + \mu^2) \]

For \(\bar{X}\): \(E[\bar{X}] = \mu\) and \(\text{Var}(\bar{X}) = \sigma^2/n\), so \(E[\bar{X}^2] = \text{Var}(\bar{X}) + (E[\bar{X}])^2 = \sigma^2/n + \mu^2\).

Thus: \[ E\left[\sum_{i=1}^n (X_i-\bar{X})^2\right] = n(\sigma^2+\mu^2) - n(\sigma^2/n + \mu^2) = n\sigma^2 + n\mu^2 - \sigma^2 - n\mu^2 = (n-1)\sigma^2 \]

Therefore: \[ E[S^2] = \frac{1}{n-1}E\left[\sum_{i=1}^n(X_i-\bar{X})^2\right] = \frac{(n-1)\sigma^2}{n-1} = \sigma^2 \quad \square \]

Tip

Where it’s used: Justifies dividing by \(n-1\) in the sample variance formula; motivation for the \(n-k\) denominator in OLS residual variance \(\hat{\sigma}^2\).


4.3 Proof 12: Cauchy-Schwarz Inequality

Theorem. For random variables \(X, Y\) with finite second moments: \[ |E[XY]|^2 \leq E[X^2]\, E[Y^2] \]

Proof.

For any \(t \in \mathbb{R}\), since \((X + tY)^2 \geq 0\) pointwise: \[ E[(X + tY)^2] \geq 0 \]

Expanding: \[ E[X^2] + 2t\,E[XY] + t^2\,E[Y^2] \geq 0 \]

This is a quadratic in \(t\): \(at^2 + bt + c \geq 0\) where \(a = E[Y^2]\), \(b = 2E[XY]\), \(c = E[X^2]\).

A quadratic \(at^2 + bt + c \geq 0\) for all \(t\) iff its discriminant \(b^2 - 4ac \leq 0\) (assuming \(a \geq 0\)): \[ 4(E[XY])^2 - 4\,E[Y^2]\,E[X^2] \leq 0 \implies |E[XY]|^2 \leq E[X^2]\,E[Y^2] \quad \square \]

Tip

Where it’s used: Bounding covariances; proving correlation \(|\rho| \leq 1\); proving that \(\text{Cov}(X,Y)^2 \leq \text{Var}(X)\text{Var}(Y)\); information-theoretic bounds.


4.4 Proof 13: Jensen’s Inequality

Theorem. Let \(\phi: \mathbb{R} \to \mathbb{R}\) be a convex function and \(X\) a random variable with \(E[|X|] < \infty\). Then: \[ \phi(E[X]) \leq E[\phi(X)] \]

Proof (supporting hyperplane argument).

Since \(\phi\) is convex, for any point \(\mu = E[X]\), there exists a supporting hyperplane at \(\mu\): a constant \(c\) such that: \[ \phi(x) \geq \phi(\mu) + c(x - \mu) \quad \forall x \in \mathbb{R} \]

(For differentiable \(\phi\), take \(c = \phi'(\mu)\); convexity guarantees such a \(c\) always exists.)

Applying this inequality with \(x = X\) (random): \[ \phi(X) \geq \phi(\mu) + c(X - \mu) \]

Taking expectations of both sides: \[ E[\phi(X)] \geq E[\phi(\mu) + c(X - \mu)] = \phi(\mu) + c\underbrace{(E[X] - \mu)}_{=0} = \phi(\mu) = \phi(E[X]) \quad \square \]

Tip

Where it’s used: Information theory (proof of KL divergence \(\geq 0\)); EM algorithm derivation (ELBO); proving that geometric mean \(\leq\) arithmetic mean; risk bounds in learning theory.


4.5 Proof 14: KL Divergence \(\geq 0\) (Gibbs’ Inequality)

Theorem. For distributions \(P\) and \(Q\) with \(P \ll Q\) (P absolutely continuous w.r.t. Q): \[ D_{KL}(P \| Q) = E_P\left[\log\frac{p(X)}{q(X)}\right] \geq 0 \] with equality iff \(P = Q\) a.e.

Proof (via Jensen’s inequality).

\[ -D_{KL}(P \| Q) = E_P\left[\log\frac{q(X)}{p(X)}\right] = E_P\left[\log\frac{q(X)}{p(X)}\right] \]

Since \(\log\) is concave (equivalently, \(-\log\) is convex), apply Jensen’s inequality with \(\phi = -\log\) (convex): \[ E_P\left[\log\frac{q(X)}{p(X)}\right] \leq \log E_P\left[\frac{q(X)}{p(X)}\right] \]

Compute the right side: \[ E_P\left[\frac{q(X)}{p(X)}\right] = \int \frac{q(x)}{p(x)} p(x)\,dx = \int q(x)\,dx = 1 \]

Therefore: \[ -D_{KL}(P\|Q) \leq \log(1) = 0 \implies D_{KL}(P\|Q) \geq 0 \quad \square \]

Equality holds iff $q(x)/p(x) = $ constant a.e., which combined with both integrating to 1 gives \(P = Q\).

Tip

Where it’s used: Variational inference (ELBO maximization); proving KL is a valid divergence measure; information theory (data processing inequality, channel capacity); EM algorithm.


4.6 Proof 15: Unbiasedness of OLS

Theorem. Under the linear model \(y = X\beta + \varepsilon\) with strict exogeneity \(E[\varepsilon | X] = 0\), the OLS estimator is unbiased: \[ E[\hat{\beta} | X] = \beta \]

Proof.

Substitute \(y = X\beta + \varepsilon\) into \(\hat{\beta} = (X^TX)^{-1}X^Ty\): \[ \hat{\beta} = (X^TX)^{-1}X^T(X\beta + \varepsilon) = (X^TX)^{-1}X^TX\beta + (X^TX)^{-1}X^T\varepsilon = \beta + (X^TX)^{-1}X^T\varepsilon \]

Taking conditional expectation given \(X\): \[ E[\hat{\beta} | X] = \beta + (X^TX)^{-1}X^T \underbrace{E[\varepsilon | X]}_{= 0} = \beta \quad \square \]

Tip

Where it’s used: Fundamental justification for OLS. Strict exogeneity (\(E[\varepsilon_i|X] = 0\), not just \(E[\varepsilon_i|x_i] = 0\)) is the key assumption — violated in dynamic panels and endogenous regressors.


4.7 Proof 16: Gauss-Markov Theorem (Sketch)

Theorem. Under classical OLS assumptions (linearity, strict exogeneity, no multicollinearity, homoskedasticity and no autocorrelation: \(\text{Var}(\varepsilon | X) = \sigma^2 I\)), the OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^Ty\) is BLUE: the Best Linear Unbiased Estimator.

Proof (sketch).

Consider any linear estimator \(\tilde{\beta} = Cy\) where \(C\) is an \(k \times n\) matrix (possibly depending on \(X\) but not \(y\)). For unbiasedness: \[ E[\tilde{\beta}|X] = CE[y|X] = CX\beta \stackrel{!}{=} \beta \quad \forall \beta \] requires \(CX = I_k\).

Write \(C = (X^TX)^{-1}X^T + D\) where \(D\) is chosen so that \(CX = I\), i.e., \(DX = 0\).

The variance of \(\tilde{\beta}\): \[ \text{Var}(\tilde{\beta}|X) = C\,\text{Var}(y|X)\,C^T = \sigma^2 CC^T \]

Expanding \(CC^T\): \[ CC^T = \left[(X^TX)^{-1}X^T + D\right]\left[X(X^TX)^{-1} + D^T\right] \] \[ = (X^TX)^{-1} + (X^TX)^{-1}X^TD^T + DX(X^TX)^{-1} + DD^T \]

Since \(DX = 0\), the middle two terms vanish: \[ CC^T = (X^TX)^{-1} + DD^T \]

Therefore: \[ \text{Var}(\tilde{\beta}|X) = \sigma^2(X^TX)^{-1} + \sigma^2 DD^T = \text{Var}(\hat{\beta}_{OLS}|X) + \sigma^2 DD^T \]

Since \(DD^T\) is positive semidefinite (\(\mathbf{v}^T DD^T \mathbf{v} = \|D^T\mathbf{v}\|^2 \geq 0\)): \[ \text{Var}(\tilde{\beta}|X) \succeq \text{Var}(\hat{\beta}_{OLS}|X) \quad \square \]

Tip

Where it’s used: Fundamental justification for using OLS when classical assumptions hold. The theorem breaks down under heteroskedasticity (use FGLS/HC standard errors) or autocorrelation (use HAC/GLS).


Back to Appendix Overview | Previous: Greek Alphabet | Next: Notation Guide