Formula Cheat Sheets

Quick Reference untuk Saat Lupa Rumus

1 Overview

Seven sections of dense, scannable formulas. No derivations here — go to Common Proofs for derivations. This page is for when you need a formula fast.


2 Section 1: Linear Algebra Essentials

2.1 Matrix Products

\[ (AB)_{ij} = \sum_k A_{ik} B_{kj} \]

\[ \text{Dimensions: } A \in \mathbb{R}^{m \times n},\ B \in \mathbb{R}^{n \times p} \Rightarrow AB \in \mathbb{R}^{m \times p} \]

2.2 Transpose Rules

Identity Formula
Transpose of product \((AB)^T = B^T A^T\)
Double transpose \((A^T)^T = A\)
Sum transpose \((A + B)^T = A^T + B^T\)
Scalar transpose \((\alpha A)^T = \alpha A^T\)
Triple product \((ABC)^T = C^T B^T A^T\)

2.3 Inverse Rules

Identity Formula Condition
Product inverse \((AB)^{-1} = B^{-1}A^{-1}\) Both invertible
Inverse transpose \((A^T)^{-1} = (A^{-1})^T\) \(A\) invertible
Double inverse \((A^{-1})^{-1} = A\) \(A\) invertible
Scalar inverse \((\alpha A)^{-1} = \alpha^{-1} A^{-1}\) \(\alpha \neq 0\)

2.4 2×2 Inverse

\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]

Valid iff \(ad - bc \neq 0\).

2.5 Woodbury Matrix Identity

\[ (A + UCV)^{-1} = A^{-1} - A^{-1}U\left(C^{-1} + VA^{-1}U\right)^{-1}VA^{-1} \]

When to use: Converting a large-matrix inverse to a small-matrix inverse. If \(A \in \mathbb{R}^{n\times n}\) and \(U, V \in \mathbb{R}^{n \times k}\) with \(k \ll n\), this replaces an \(n \times n\) inverse with a \(k \times k\) inverse.

Special case (Sherman-Morrison): \(U = \mathbf{u}\), \(C = 1\), \(V = \mathbf{v}^T\): \[ (A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^TA^{-1}}{1 + \mathbf{v}^TA^{-1}\mathbf{u}} \]

Matrix determinant lemma: \[ \det(A + \mathbf{u}\mathbf{v}^T) = (1 + \mathbf{v}^TA^{-1}\mathbf{u})\det(A) \]

2.6 Trace Identities

Identity Formula
Cyclic property \(\text{tr}(AB) = \text{tr}(BA)\)
Cyclic (triple) \(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\)
Linearity \(\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\)
Scalar \(\text{tr}(\alpha A) = \alpha\,\text{tr}(A)\)
Transpose \(\text{tr}(A) = \text{tr}(A^T)\)
Quadratic form \(\mathbf{x}^TA\mathbf{x} = \text{tr}(\mathbf{x}^TA\mathbf{x}) = \text{tr}(A\mathbf{x}\mathbf{x}^T)\)
Frobenius inner product \(\text{tr}(A^TB) = \sum_{i,j}A_{ij}B_{ij}\)

2.7 Eigenvalue Identities

Quantity Formula
Trace \(\text{tr}(A) = \sum_{i=1}^n \lambda_i\)
Determinant \(\det(A) = \prod_{i=1}^n \lambda_i\)
Eigenvalues of \(A^{-1}\) \(1/\lambda_i\)
Eigenvalues of \(A^k\) \(\lambda_i^k\)
Eigenvalues of \(A + cI\) \(\lambda_i + c\)
Eigenvalues of \(A^T\) Same as \(A\)

3 Section 2: Matrix Calculus Quick Reference

Convention: \(f: \mathbb{R}^n \to \mathbb{R}\), gradient \(\nabla_\mathbf{x} f\) is a column vector.

3.1 Derivatives with Respect to a Vector \(\mathbf{x}\)

Expression Derivative \(\frac{\partial}{\partial \mathbf{x}}\)
\(\mathbf{a}^T \mathbf{x}\) \(\mathbf{a}\)
\(\mathbf{x}^T \mathbf{a}\) \(\mathbf{a}\)
\(\mathbf{x}^T \mathbf{x}\) \(2\mathbf{x}\)
\(\mathbf{x}^T A \mathbf{x}\) \((A + A^T)\mathbf{x}\)
\(\mathbf{x}^T A \mathbf{x}\) (A symmetric) \(2A\mathbf{x}\)
\(\mathbf{a}^T \mathbf{x} \mathbf{x}^T \mathbf{b}\) \((\mathbf{a}^T\mathbf{x})\mathbf{b} + (\mathbf{b}^T\mathbf{x})\mathbf{a}\)
\(\|A\mathbf{x} - \mathbf{b}\|^2\) \(2A^T(A\mathbf{x} - \mathbf{b})\)
\(\|A\mathbf{x} - \mathbf{b}\|^2_W\) (weighted) \(2A^TW(A\mathbf{x} - \mathbf{b})\)

3.2 Derivatives with Respect to a Matrix \(X\)

Expression Derivative \(\frac{\partial}{\partial X}\) Shape
\(\mathbf{a}^T X \mathbf{b}\) \(\mathbf{a}\mathbf{b}^T\) Same as \(X\)
\(\text{tr}(AX)\) \(A^T\) Same as \(X\)
\(\text{tr}(X^TA)\) \(A\) Same as \(X\)
\(\text{tr}(AXB)\) \(A^T(B)^T = (BA)^T\) Same as \(X\)
\(\text{tr}(AXBX^T)\) \(B^TX^TA^T + BX^TA\) Same as \(X\)
\(\text{tr}(X^TAX)\) \((A + A^T)X\) Same as \(X\)
\(\text{tr}(XAX^T)\) \(X(A + A^T)\) Same as \(X\)
\(\log\det(X)\) \(X^{-T} = (X^T)^{-1}\) Same as \(X\)
\(\log\det(X)\) (X symmetric) \(X^{-1}\) Same as \(X\)
\(\det(X)\) \(\det(X) X^{-T}\) Same as \(X\)
\(\text{tr}(X^{-1}A)\) \(-X^{-T}AX^{-T}\) Same as \(X\)

3.3 Second Derivatives (Hessians)

Expression Hessian \(\frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{x}^T}\)
\(\mathbf{a}^T\mathbf{x}\) \(0\)
\(\mathbf{x}^T A \mathbf{x}\) \(A + A^T\)
\(\mathbf{x}^T A \mathbf{x}\) (A symmetric) \(2A\)
\(\|A\mathbf{x} - \mathbf{b}\|^2\) \(2A^TA\)

4 Section 3: Probability Distribution Summary

All distributions use the following column order: Parameters | Support | PDF/PMF | Mean | Variance | MGF | Notes.

4.1 Discrete Distributions

Bernoulli(\(p\))

Item Value
Parameters \(p \in [0,1]\)
Support \(\{0, 1\}\)
PMF \(P(X=k) = p^k(1-p)^{1-k}\)
Mean \(p\)
Variance \(p(1-p)\)
MGF \(1 - p + pe^t\)

Binomial(\(n, p\))

Item Value
Parameters \(n \in \mathbb{N}\), \(p \in [0,1]\)
Support \(\{0, 1, \ldots, n\}\)
PMF \(\binom{n}{k}p^k(1-p)^{n-k}\)
Mean \(np\)
Variance \(np(1-p)\)
MGF \((1 - p + pe^t)^n\)

Geometric(\(p\))

Item Value
Parameters \(p \in (0,1]\)
Support \(\{1, 2, 3, \ldots\}\)
PMF \((1-p)^{k-1}p\)
Mean \(1/p\)
Variance \((1-p)/p^2\)
Notes Number of trials until first success

Poisson(\(\lambda\))

Item Value
Parameters \(\lambda > 0\)
Support \(\{0, 1, 2, \ldots\}\)
PMF \(e^{-\lambda}\lambda^k / k!\)
Mean \(\lambda\)
Variance \(\lambda\)
MGF \(e^{\lambda(e^t - 1)}\)
Notes Limit of Binomial as \(n \to \infty\), \(np \to \lambda\)

Negative Binomial(\(r, p\))

Item Value
Parameters \(r > 0\), \(p \in (0,1)\)
Support \(\{0, 1, 2, \ldots\}\)
PMF \(\binom{k+r-1}{k}p^r(1-p)^k\)
Mean \(r(1-p)/p\)
Variance \(r(1-p)/p^2\)
Notes \(k\) failures before \(r\)-th success; overdispersed count model

4.2 Continuous Distributions

Uniform(\(a, b\))

Item Value
Parameters \(a < b\)
Support \([a, b]\)
PDF \(1/(b-a)\)
Mean \((a+b)/2\)
Variance \((b-a)^2/12\)
MGF \((e^{tb} - e^{ta})/(t(b-a))\)

Normal(\(\mu, \sigma^2\))

Item Value
Parameters \(\mu \in \mathbb{R}\), \(\sigma^2 > 0\)
Support \(\mathbb{R}\)
PDF \(\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\)
Mean \(\mu\)
Variance \(\sigma^2\)
MGF \(\exp(\mu t + \sigma^2 t^2/2)\)
Notes If \(X \sim N(\mu,\sigma^2)\), then \((X-\mu)/\sigma \sim N(0,1)\)

Standard Normal

Item Value
PDF \(\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\)
CDF \(\Phi(x) = \int_{-\infty}^x \phi(t)\,dt\)
Mean \(0\)
Variance \(1\)
Key quantiles \(\Phi(1.645) = 0.95\), \(\Phi(1.96) = 0.975\), \(\Phi(2.576) = 0.995\)

Exponential(\(\lambda\))

Item Value
Parameters \(\lambda > 0\) (rate)
Support \([0, \infty)\)
PDF \(\lambda e^{-\lambda x}\)
Mean \(1/\lambda\)
Variance \(1/\lambda^2\)
MGF \(\lambda/(\lambda - t)\) for \(t < \lambda\)
Notes Memoryless; hazard rate \(= \lambda\) (constant)

Gamma(\(\alpha, \beta\))

Item Value
Parameters \(\alpha > 0\) (shape), \(\beta > 0\) (rate)
Support \((0, \infty)\)
PDF \(\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\)
Mean \(\alpha/\beta\)
Variance \(\alpha/\beta^2\)
MGF \((\beta/(\beta-t))^\alpha\) for \(t < \beta\)
Notes Exponential \(= \text{Gamma}(1, \lambda)\); \(\chi^2(k) = \text{Gamma}(k/2, 1/2)\)

Beta(\(\alpha, \beta\))

Item Value
Parameters \(\alpha, \beta > 0\)
Support \([0, 1]\)
PDF \(\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\) where \(B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)\)
Mean \(\alpha/(\alpha+\beta)\)
Variance \(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)
Notes Conjugate prior for Bernoulli/Binomial; \(\text{Uniform}(0,1) = \text{Beta}(1,1)\)

Chi-squared(\(k\))

Item Value
Parameters \(k > 0\) (degrees of freedom)
Support \((0, \infty)\)
PDF \(\text{Gamma}(k/2, 1/2)\) distribution
Mean \(k\)
Variance \(2k\)
Notes If \(Z_i \stackrel{iid}{\sim} N(0,1)\), then \(\sum_{i=1}^k Z_i^2 \sim \chi^2(k)\)

Student’s \(t(k)\)

Item Value
Parameters \(k > 0\) (degrees of freedom)
Support \(\mathbb{R}\)
PDF \(\frac{\Gamma((k+1)/2)}{\sqrt{k\pi}\,\Gamma(k/2)}\left(1 + \frac{x^2}{k}\right)^{-(k+1)/2}\)
Mean \(0\) (for \(k > 1\))
Variance \(k/(k-2)\) (for \(k > 2\))
Notes \(t(k) \to N(0,1)\) as \(k \to \infty\); heavier tails for small \(k\)

\(F(m, n)\)

Item Value
Parameters \(m, n > 0\) (numerator/denominator df)
Support \((0, \infty)\)
Mean \(n/(n-2)\) for \(n > 2\)
Variance \(\frac{2n^2(m+n-2)}{m(n-2)^2(n-4)}\) for \(n > 4\)
Notes If \(U \sim \chi^2(m)\), \(V \sim \chi^2(n)\) independent: \(F = (U/m)/(V/n) \sim F(m,n)\)

Cauchy(\(x_0, \gamma\))

Item Value
Parameters \(x_0 \in \mathbb{R}\) (location), \(\gamma > 0\) (scale)
Support \(\mathbb{R}\)
PDF \(\frac{1}{\pi\gamma}\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]^{-1}\)
Mean Undefined
Variance Undefined
Notes \(t(1)\) distribution; no moments; ratio of two independent normals

Laplace(\(\mu, b\))

Item Value
Parameters \(\mu \in \mathbb{R}\) (location), \(b > 0\) (scale)
Support \(\mathbb{R}\)
PDF \(\frac{1}{2b}\exp\!\left(-\frac{|x-\mu|}{b}\right)\)
Mean \(\mu\)
Variance \(2b^2\)
Notes Prior for LASSO (\(b = 1/\lambda\)); equivalent to L1 regularization in MAP estimation

Multivariate Normal(\(\boldsymbol\mu, \Sigma\))

Item Value
Parameters \(\boldsymbol\mu \in \mathbb{R}^p\), \(\Sigma \in \mathbb{S}^p_{++}\)
Support \(\mathbb{R}^p\)
PDF \((2\pi)^{-p/2}\det(\Sigma)^{-1/2}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right)\)
Mean \(\boldsymbol\mu\)
Covariance \(\Sigma\)
MGF \(\exp(\boldsymbol\mu^T\mathbf{t} + \frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t})\)
Notes Marginals and conditionals are also normal

Dirichlet(\(\boldsymbol\alpha\))

Item Value
Parameters \(\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_K)\), \(\alpha_k > 0\)
Support Simplex \(\{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}\)
PDF \(\frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)}\prod_k p_k^{\alpha_k - 1}\) where \(\alpha_0 = \sum_k\alpha_k\)
Mean \(\alpha_k/\alpha_0\) for each component
Notes Conjugate prior for Categorical/Multinomial; generalization of Beta

Wishart(\(V, n\))

Item Value
Parameters \(V \in \mathbb{S}^p_{++}\) (scale), \(n > p-1\) (df)
Support \(\mathbb{S}^p_{++}\)
Mean \(nV\)
Notes Distribution of \(\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^T\) for \(\mathbf{x}_i \stackrel{iid}{\sim} N(0, V)\); matrix generalization of \(\chi^2\); conjugate prior for precision matrix

5 Section 4: OLS Cheat Sheet

5.1 Setup

\[ y = X\beta + \varepsilon, \quad y \in \mathbb{R}^n,\ X \in \mathbb{R}^{n \times k},\ \beta \in \mathbb{R}^k,\ \varepsilon \in \mathbb{R}^n \]

5.2 Core Formulas

Quantity Formula Notes
OLS estimator \(\hat{\beta} = (X^TX)^{-1}X^Ty\) Requires \(\text{rank}(X) = k\)
Fitted values \(\hat{y} = X\hat{\beta} = P_X y\)
Residuals \(\hat{\varepsilon} = y - \hat{y} = M_X y\) \(M_X = I - P_X\)
Projection matrix \(P_X = X(X^TX)^{-1}X^T\) Symmetric, idempotent: \(P_X^2 = P_X\)
Annihilator \(M_X = I - P_X\) Symmetric, idempotent: \(M_X^2 = M_X\)
Error variance \(\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)\) Unbiased estimator of \(\sigma^2\)
Var of \(\hat{\beta}\) \(\text{Var}(\hat{\beta}|X) = \sigma^2(X^TX)^{-1}\) Under homoskedasticity
Std error of \(\hat{\beta}_j\) \(\hat{\text{se}}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(X^TX)^{-1}]_{jj}}\)

5.3 Goodness of Fit

Quantity Formula
SSR \(\hat{\varepsilon}^T\hat{\varepsilon} = y^TM_X y\)
SST \(y^T M_\iota y = \sum_i(y_i - \bar{y})^2\)
SSE \(\text{SST} - \text{SSR}\)
\(R^2\) \(1 - \text{SSR}/\text{SST} = \text{SSE}/\text{SST}\)
\(\bar{R}^2\) \(1 - (1-R^2)(n-1)/(n-k)\)

5.4 Test Statistics

Test Statistic Distribution under \(H_0\)
\(t\)-test: \(H_0: \beta_j = 0\) \(t_j = \hat{\beta}_j / \hat{\text{se}}(\hat{\beta}_j)\) \(t(n-k)\)
\(F\)-test: \(H_0: R\beta = r\) (\(q\) restrictions) \(F = \frac{(R\hat{\beta}-r)^T[R(X^TX)^{-1}R^T]^{-1}(R\hat{\beta}-r)/q}{\hat{\sigma}^2}\) \(F(q, n-k)\)
\(F\)-test (model vs restricted) \(F = \frac{(\text{SSR}_0 - \text{SSR}_1)/q}{\text{SSR}_1/(n-k)}\) \(F(q, n-k)\)
Wald statistic \(W = (R\hat{\beta} - r)^T[\hat{V}(R\hat{\beta})]^{-1}(R\hat{\beta} - r)\) \(\chi^2(q)\) asymptotically

5.5 Estimator Generalizations

Estimator Formula When to use
GLS \(\hat{\beta}_{GLS} = (X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}y\) Known \(\Omega = \text{Var}(\varepsilon|X)/\sigma^2\)
FGLS Replace \(\Omega\) with \(\hat{\Omega}\) Estimated \(\Omega\) (feasible)
IV \(\hat{\beta}_{IV} = (Z^TX)^{-1}Z^Ty\) One instrument per regressor (\(\ell = k\))
2SLS \(\hat{\beta}_{2SLS} = (X^TP_ZX)^{-1}X^TP_Zy\) Overidentified (\(\ell \geq k\))
GMM \(\hat{\beta}_{GMM} = (X^TZW Z^TX)^{-1}X^TZWZ^Ty\) Efficient IV with weight matrix \(W\)

6 Section 5: Optimization Cheat Sheet

6.1 Unconstrained Optimization

Condition Formula Interpretation
First-order necessary (FOC) \(\nabla f(x^*) = 0\) Zero gradient at optimum
Second-order sufficient (SOC) — minimum \(\nabla^2 f(x^*) \succ 0\) Positive definite Hessian
Second-order sufficient (SOC) — maximum \(\nabla^2 f(x^*) \prec 0\) Negative definite Hessian
Saddle point \(\nabla^2 f(x^*)\) indefinite Neither min nor max

6.2 Constrained Optimization — Equality Constraints

Problem: \(\min_x f(x)\) subject to \(g(x) = 0\)

Lagrangian: \(\mathcal{L}(x, \lambda) = f(x) - \lambda^T g(x)\)

FOC: \(\nabla_x \mathcal{L} = 0\) and \(\nabla_\lambda \mathcal{L} = 0\), i.e., \(\nabla f(x) = \lambda^T \nabla g(x)\) and \(g(x) = 0\).

6.3 KKT Conditions — Inequality Constraints

Problem: \(\min_x f(x)\) subject to \(g_i(x) \leq 0\), \(i = 1, \ldots, m\), and \(h_j(x) = 0\), \(j = 1, \ldots, p\).

KKT conditions (necessary for optimality):

\[ \nabla f(x^*) + \sum_i \mu_i \nabla g_i(x^*) + \sum_j \lambda_j \nabla h_j(x^*) = 0 \quad \text{(stationarity)} \]

\[ g_i(x^*) \leq 0 \quad \forall i \quad \text{(primal feasibility)} \]

\[ h_j(x^*) = 0 \quad \forall j \quad \text{(primal feasibility)} \]

\[ \mu_i \geq 0 \quad \forall i \quad \text{(dual feasibility)} \]

\[ \mu_i g_i(x^*) = 0 \quad \forall i \quad \text{(complementary slackness)} \]

6.4 Duality

Concept Formula
Dual function \(d(\lambda, \mu) = \inf_x \mathcal{L}(x, \lambda, \mu)\)
Weak duality \(d(\lambda, \mu) \leq p^*\) always
Strong duality \(d(\lambda^*, \mu^*) = p^*\) (holds under Slater’s condition for convex problems)
Duality gap \(p^* - d^*\)

6.5 Regularized Regression Formulas

Method Estimator Closed-form
Ridge (\(L^2\)) \(\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\) \(\hat{\beta}_\lambda = (X^TX + \lambda I)^{-1}X^Ty\)
LASSO (\(L^1\)) \(\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|_1\) No closed form; coordinate descent
LASSO solution Soft-threshold \(\hat{\beta}_j = \text{sign}(\hat{\beta}_j^{OLS})(|\hat{\beta}_j^{OLS}| - \lambda/2)_+\)
Elastic Net \(\min_\beta \|y-X\beta\|^2 + \alpha\lambda\|\beta\|_1 + (1-\alpha)\lambda\|\beta\|^2\) No closed form

Soft-threshold operator: \(S(z, \gamma) = \text{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma \\ 0 & |z| \leq \gamma \\ z + \gamma & z < -\gamma \end{cases}\)

6.6 Gradient Descent Variants

Method Update Rule Notes
Gradient descent \(x_{t+1} = x_t - \eta \nabla f(x_t)\) Full gradient
SGD \(x_{t+1} = x_t - \eta \nabla f_i(x_t)\) Single sample \(i\)
Mini-batch SGD \(x_{t+1} = x_t - \eta \frac{1}{|B|}\sum_{i \in B}\nabla f_i(x_t)\) Batch \(B\)
Momentum \(v_{t+1} = \gamma v_t + \eta \nabla f(x_t)\); \(x_{t+1} = x_t - v_{t+1}\) \(\gamma \approx 0.9\)
Newton \(x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1}\nabla f(x_t)\) Quadratic convergence

7 Section 6: Information Theory Quick Reference

Quantity Formula Notes
Entropy \(H(X) = -\sum_x p(x)\log p(x)\) \(= -E[\log p(X)]\); bits if log base 2, nats if natural log
Differential entropy \(h(X) = -\int p(x)\log p(x)\,dx\) Can be negative
Conditional entropy \(H(X|Y) = -\sum_{x,y} p(x,y)\log p(x|y)\) \(= H(X,Y) - H(Y)\)
Joint entropy \(H(X,Y) = -\sum_{x,y} p(x,y)\log p(x,y)\)
Mutual information \(I(X;Y) = H(X) - H(X|Y)\) \(= H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)\)
KL divergence \(D_{KL}(P \| Q) = \sum_x p(x)\log\frac{p(x)}{q(x)}\) \(\geq 0\); not symmetric
Cross-entropy \(H(P, Q) = -\sum_x p(x)\log q(x)\) \(= H(P) + D_{KL}(P\|Q)\)
Binary cross-entropy \(L = -y\log\hat{p} - (1-y)\log(1-\hat{p})\) Standard logistic regression loss

7.1 Entropy of Common Distributions

Distribution Entropy
Bernoulli(\(p\)) \(-p\log p - (1-p)\log(1-p)\)
\(\text{Uniform}\{1,\ldots,n\}\) \(\log n\)
\(N(\mu, \sigma^2)\) \(\frac{1}{2}\log(2\pi e \sigma^2)\)
\(N(\boldsymbol\mu, \Sigma)\) \(\frac{1}{2}\log\det(2\pi e\,\Sigma)\)
\(\text{Exponential}(\lambda)\) \(1 - \log\lambda\)

7.2 KL Divergence Between Gaussians

\[ D_{KL}\!\left(N(\mu_1, \Sigma_1) \| N(\mu_2, \Sigma_2)\right) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1) - p + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right] \]


8 Section 7: Key Theorems Quick Reference

Theorem Statement Where It’s Used
Gauss-Markov Under classical OLS assumptions, \(\hat{\beta}_{OLS}\) is BLUE: \(\text{Var}(\hat{\beta}_{OLS}) \preceq \text{Var}(\tilde{\beta})\) for any linear unbiased \(\tilde{\beta}\) Justifies OLS when \(\text{Var}(\varepsilon|X) = \sigma^2 I\)
Frisch-Waugh-Lovell (FWL) In \(y = X_1\beta_1 + X_2\beta_2 + \varepsilon\), \(\hat{\beta}_2\) from regressing \(M_{X_1}y\) on \(M_{X_1}X_2\) Within estimator (FE), partialling out regressors
Rank-Nullity \(\text{rank}(A) + \text{nullity}(A) = n\) for \(A \in \mathbb{R}^{m \times n}\) Degrees of freedom, dimensionality analysis
Eckart-Young-Mirsky Best rank-\(k\) approximation of \(A\) in spectral/\(F\)-norm is \(\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^T\) (truncated SVD) PCA, low-rank approximation, compression
Universal Approximation A single hidden layer NN with enough neurons can approximate any continuous function on a compact set Theoretical basis for neural networks
Central Limit Theorem (CLT) \(\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)\) as \(n \to \infty\) Asymptotic inference, hypothesis testing, OLS asymptotics
Law of Large Numbers (LLN) \(\bar{X}_n \xrightarrow{p} \mu\) as \(n \to \infty\) Consistency of estimators, justifying probability limits
Delta Method If \(\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)\) and \(g\) is differentiable, then \(\sqrt{n}(g(\hat{\theta}) - g(\theta_0)) \xrightarrow{d} N(0, [g'(\theta_0)]^T V g'(\theta_0))\) Asymptotic distribution of nonlinear transformations
Slutsky’s Theorem If \(X_n \xrightarrow{d} X\) and \(Y_n \xrightarrow{p} c\) (constant), then \(X_n + Y_n \xrightarrow{d} X + c\) and \(X_n Y_n \xrightarrow{d} cX\) Combining convergence results in asymptotic proofs
Continuous Mapping Theorem If \(X_n \xrightarrow{d} X\) and \(g\) is continuous, then \(g(X_n) \xrightarrow{d} g(X)\) Deriving distributions of test statistics
Jensen’s Inequality For convex \(\phi\): \(\phi(E[X]) \leq E[\phi(X)]\) Information theory proofs, ELBO derivation, EM algorithm
Bayes’ Theorem \(P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)}\), i.e., posterior \(\propto\) likelihood \(\times\) prior Bayesian inference, MAP estimation
Cauchy-Schwarz \(|E[XY]|^2 \leq E[X^2]E[Y^2]\); equivalently \(|\mathbf{x}^T\mathbf{y}|^2 \leq \|\mathbf{x}\|^2\|\mathbf{y}\|^2\) Bounding correlations, proving \(|\rho| \leq 1\)
Spectral Theorem Every real symmetric matrix \(A\) has \(A = Q\Lambda Q^T\) with orthogonal \(Q\) and real diagonal \(\Lambda\) PCA, quadratic forms, covariance analysis
Law of Iterated Expectations \(E[X] = E[E[X|Y]]\) Simplifying nested expectations; iterated projections
Law of Total Variance \(\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y])\) Variance decomposition; ANOVA; random effects models

Back to Appendix Overview | Previous: Notation Guide