Formula Cheat Sheets
Quick Reference untuk Saat Lupa Rumus
1 Overview
Seven sections of dense, scannable formulas. No derivations here — go to Common Proofs for derivations. This page is for when you need a formula fast.
2 Section 1: Linear Algebra Essentials
2.1 Matrix Products
\[ (AB)_{ij} = \sum_k A_{ik} B_{kj} \]
\[ \text{Dimensions: } A \in \mathbb{R}^{m \times n},\ B \in \mathbb{R}^{n \times p} \Rightarrow AB \in \mathbb{R}^{m \times p} \]
2.2 Transpose Rules
| Identity | Formula |
|---|---|
| Transpose of product | \((AB)^T = B^T A^T\) |
| Double transpose | \((A^T)^T = A\) |
| Sum transpose | \((A + B)^T = A^T + B^T\) |
| Scalar transpose | \((\alpha A)^T = \alpha A^T\) |
| Triple product | \((ABC)^T = C^T B^T A^T\) |
2.3 Inverse Rules
| Identity | Formula | Condition |
|---|---|---|
| Product inverse | \((AB)^{-1} = B^{-1}A^{-1}\) | Both invertible |
| Inverse transpose | \((A^T)^{-1} = (A^{-1})^T\) | \(A\) invertible |
| Double inverse | \((A^{-1})^{-1} = A\) | \(A\) invertible |
| Scalar inverse | \((\alpha A)^{-1} = \alpha^{-1} A^{-1}\) | \(\alpha \neq 0\) |
2.4 2×2 Inverse
\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]
Valid iff \(ad - bc \neq 0\).
2.5 Woodbury Matrix Identity
\[ (A + UCV)^{-1} = A^{-1} - A^{-1}U\left(C^{-1} + VA^{-1}U\right)^{-1}VA^{-1} \]
When to use: Converting a large-matrix inverse to a small-matrix inverse. If \(A \in \mathbb{R}^{n\times n}\) and \(U, V \in \mathbb{R}^{n \times k}\) with \(k \ll n\), this replaces an \(n \times n\) inverse with a \(k \times k\) inverse.
Special case (Sherman-Morrison): \(U = \mathbf{u}\), \(C = 1\), \(V = \mathbf{v}^T\): \[ (A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^TA^{-1}}{1 + \mathbf{v}^TA^{-1}\mathbf{u}} \]
Matrix determinant lemma: \[ \det(A + \mathbf{u}\mathbf{v}^T) = (1 + \mathbf{v}^TA^{-1}\mathbf{u})\det(A) \]
2.6 Trace Identities
| Identity | Formula |
|---|---|
| Cyclic property | \(\text{tr}(AB) = \text{tr}(BA)\) |
| Cyclic (triple) | \(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\) |
| Linearity | \(\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\) |
| Scalar | \(\text{tr}(\alpha A) = \alpha\,\text{tr}(A)\) |
| Transpose | \(\text{tr}(A) = \text{tr}(A^T)\) |
| Quadratic form | \(\mathbf{x}^TA\mathbf{x} = \text{tr}(\mathbf{x}^TA\mathbf{x}) = \text{tr}(A\mathbf{x}\mathbf{x}^T)\) |
| Frobenius inner product | \(\text{tr}(A^TB) = \sum_{i,j}A_{ij}B_{ij}\) |
2.7 Eigenvalue Identities
| Quantity | Formula |
|---|---|
| Trace | \(\text{tr}(A) = \sum_{i=1}^n \lambda_i\) |
| Determinant | \(\det(A) = \prod_{i=1}^n \lambda_i\) |
| Eigenvalues of \(A^{-1}\) | \(1/\lambda_i\) |
| Eigenvalues of \(A^k\) | \(\lambda_i^k\) |
| Eigenvalues of \(A + cI\) | \(\lambda_i + c\) |
| Eigenvalues of \(A^T\) | Same as \(A\) |
3 Section 2: Matrix Calculus Quick Reference
Convention: \(f: \mathbb{R}^n \to \mathbb{R}\), gradient \(\nabla_\mathbf{x} f\) is a column vector.
3.1 Derivatives with Respect to a Vector \(\mathbf{x}\)
| Expression | Derivative \(\frac{\partial}{\partial \mathbf{x}}\) |
|---|---|
| \(\mathbf{a}^T \mathbf{x}\) | \(\mathbf{a}\) |
| \(\mathbf{x}^T \mathbf{a}\) | \(\mathbf{a}\) |
| \(\mathbf{x}^T \mathbf{x}\) | \(2\mathbf{x}\) |
| \(\mathbf{x}^T A \mathbf{x}\) | \((A + A^T)\mathbf{x}\) |
| \(\mathbf{x}^T A \mathbf{x}\) (A symmetric) | \(2A\mathbf{x}\) |
| \(\mathbf{a}^T \mathbf{x} \mathbf{x}^T \mathbf{b}\) | \((\mathbf{a}^T\mathbf{x})\mathbf{b} + (\mathbf{b}^T\mathbf{x})\mathbf{a}\) |
| \(\|A\mathbf{x} - \mathbf{b}\|^2\) | \(2A^T(A\mathbf{x} - \mathbf{b})\) |
| \(\|A\mathbf{x} - \mathbf{b}\|^2_W\) (weighted) | \(2A^TW(A\mathbf{x} - \mathbf{b})\) |
3.2 Derivatives with Respect to a Matrix \(X\)
| Expression | Derivative \(\frac{\partial}{\partial X}\) | Shape |
|---|---|---|
| \(\mathbf{a}^T X \mathbf{b}\) | \(\mathbf{a}\mathbf{b}^T\) | Same as \(X\) |
| \(\text{tr}(AX)\) | \(A^T\) | Same as \(X\) |
| \(\text{tr}(X^TA)\) | \(A\) | Same as \(X\) |
| \(\text{tr}(AXB)\) | \(A^T(B)^T = (BA)^T\) | Same as \(X\) |
| \(\text{tr}(AXBX^T)\) | \(B^TX^TA^T + BX^TA\) | Same as \(X\) |
| \(\text{tr}(X^TAX)\) | \((A + A^T)X\) | Same as \(X\) |
| \(\text{tr}(XAX^T)\) | \(X(A + A^T)\) | Same as \(X\) |
| \(\log\det(X)\) | \(X^{-T} = (X^T)^{-1}\) | Same as \(X\) |
| \(\log\det(X)\) (X symmetric) | \(X^{-1}\) | Same as \(X\) |
| \(\det(X)\) | \(\det(X) X^{-T}\) | Same as \(X\) |
| \(\text{tr}(X^{-1}A)\) | \(-X^{-T}AX^{-T}\) | Same as \(X\) |
3.3 Second Derivatives (Hessians)
| Expression | Hessian \(\frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{x}^T}\) |
|---|---|
| \(\mathbf{a}^T\mathbf{x}\) | \(0\) |
| \(\mathbf{x}^T A \mathbf{x}\) | \(A + A^T\) |
| \(\mathbf{x}^T A \mathbf{x}\) (A symmetric) | \(2A\) |
| \(\|A\mathbf{x} - \mathbf{b}\|^2\) | \(2A^TA\) |
4 Section 3: Probability Distribution Summary
All distributions use the following column order: Parameters | Support | PDF/PMF | Mean | Variance | MGF | Notes.
4.1 Discrete Distributions
Bernoulli(\(p\))
| Item | Value |
|---|---|
| Parameters | \(p \in [0,1]\) |
| Support | \(\{0, 1\}\) |
| PMF | \(P(X=k) = p^k(1-p)^{1-k}\) |
| Mean | \(p\) |
| Variance | \(p(1-p)\) |
| MGF | \(1 - p + pe^t\) |
Binomial(\(n, p\))
| Item | Value |
|---|---|
| Parameters | \(n \in \mathbb{N}\), \(p \in [0,1]\) |
| Support | \(\{0, 1, \ldots, n\}\) |
| PMF | \(\binom{n}{k}p^k(1-p)^{n-k}\) |
| Mean | \(np\) |
| Variance | \(np(1-p)\) |
| MGF | \((1 - p + pe^t)^n\) |
Geometric(\(p\))
| Item | Value |
|---|---|
| Parameters | \(p \in (0,1]\) |
| Support | \(\{1, 2, 3, \ldots\}\) |
| PMF | \((1-p)^{k-1}p\) |
| Mean | \(1/p\) |
| Variance | \((1-p)/p^2\) |
| Notes | Number of trials until first success |
Poisson(\(\lambda\))
| Item | Value |
|---|---|
| Parameters | \(\lambda > 0\) |
| Support | \(\{0, 1, 2, \ldots\}\) |
| PMF | \(e^{-\lambda}\lambda^k / k!\) |
| Mean | \(\lambda\) |
| Variance | \(\lambda\) |
| MGF | \(e^{\lambda(e^t - 1)}\) |
| Notes | Limit of Binomial as \(n \to \infty\), \(np \to \lambda\) |
Negative Binomial(\(r, p\))
| Item | Value |
|---|---|
| Parameters | \(r > 0\), \(p \in (0,1)\) |
| Support | \(\{0, 1, 2, \ldots\}\) |
| PMF | \(\binom{k+r-1}{k}p^r(1-p)^k\) |
| Mean | \(r(1-p)/p\) |
| Variance | \(r(1-p)/p^2\) |
| Notes | \(k\) failures before \(r\)-th success; overdispersed count model |
4.2 Continuous Distributions
Uniform(\(a, b\))
| Item | Value |
|---|---|
| Parameters | \(a < b\) |
| Support | \([a, b]\) |
| \(1/(b-a)\) | |
| Mean | \((a+b)/2\) |
| Variance | \((b-a)^2/12\) |
| MGF | \((e^{tb} - e^{ta})/(t(b-a))\) |
Normal(\(\mu, \sigma^2\))
| Item | Value |
|---|---|
| Parameters | \(\mu \in \mathbb{R}\), \(\sigma^2 > 0\) |
| Support | \(\mathbb{R}\) |
| \(\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\) | |
| Mean | \(\mu\) |
| Variance | \(\sigma^2\) |
| MGF | \(\exp(\mu t + \sigma^2 t^2/2)\) |
| Notes | If \(X \sim N(\mu,\sigma^2)\), then \((X-\mu)/\sigma \sim N(0,1)\) |
Standard Normal
| Item | Value |
|---|---|
| \(\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}\) | |
| CDF | \(\Phi(x) = \int_{-\infty}^x \phi(t)\,dt\) |
| Mean | \(0\) |
| Variance | \(1\) |
| Key quantiles | \(\Phi(1.645) = 0.95\), \(\Phi(1.96) = 0.975\), \(\Phi(2.576) = 0.995\) |
Exponential(\(\lambda\))
| Item | Value |
|---|---|
| Parameters | \(\lambda > 0\) (rate) |
| Support | \([0, \infty)\) |
| \(\lambda e^{-\lambda x}\) | |
| Mean | \(1/\lambda\) |
| Variance | \(1/\lambda^2\) |
| MGF | \(\lambda/(\lambda - t)\) for \(t < \lambda\) |
| Notes | Memoryless; hazard rate \(= \lambda\) (constant) |
Gamma(\(\alpha, \beta\))
| Item | Value |
|---|---|
| Parameters | \(\alpha > 0\) (shape), \(\beta > 0\) (rate) |
| Support | \((0, \infty)\) |
| \(\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\) | |
| Mean | \(\alpha/\beta\) |
| Variance | \(\alpha/\beta^2\) |
| MGF | \((\beta/(\beta-t))^\alpha\) for \(t < \beta\) |
| Notes | Exponential \(= \text{Gamma}(1, \lambda)\); \(\chi^2(k) = \text{Gamma}(k/2, 1/2)\) |
Beta(\(\alpha, \beta\))
| Item | Value |
|---|---|
| Parameters | \(\alpha, \beta > 0\) |
| Support | \([0, 1]\) |
| \(\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}\) where \(B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)\) | |
| Mean | \(\alpha/(\alpha+\beta)\) |
| Variance | \(\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\) |
| Notes | Conjugate prior for Bernoulli/Binomial; \(\text{Uniform}(0,1) = \text{Beta}(1,1)\) |
Chi-squared(\(k\))
| Item | Value |
|---|---|
| Parameters | \(k > 0\) (degrees of freedom) |
| Support | \((0, \infty)\) |
| \(\text{Gamma}(k/2, 1/2)\) distribution | |
| Mean | \(k\) |
| Variance | \(2k\) |
| Notes | If \(Z_i \stackrel{iid}{\sim} N(0,1)\), then \(\sum_{i=1}^k Z_i^2 \sim \chi^2(k)\) |
Student’s \(t(k)\)
| Item | Value |
|---|---|
| Parameters | \(k > 0\) (degrees of freedom) |
| Support | \(\mathbb{R}\) |
| \(\frac{\Gamma((k+1)/2)}{\sqrt{k\pi}\,\Gamma(k/2)}\left(1 + \frac{x^2}{k}\right)^{-(k+1)/2}\) | |
| Mean | \(0\) (for \(k > 1\)) |
| Variance | \(k/(k-2)\) (for \(k > 2\)) |
| Notes | \(t(k) \to N(0,1)\) as \(k \to \infty\); heavier tails for small \(k\) |
\(F(m, n)\)
| Item | Value |
|---|---|
| Parameters | \(m, n > 0\) (numerator/denominator df) |
| Support | \((0, \infty)\) |
| Mean | \(n/(n-2)\) for \(n > 2\) |
| Variance | \(\frac{2n^2(m+n-2)}{m(n-2)^2(n-4)}\) for \(n > 4\) |
| Notes | If \(U \sim \chi^2(m)\), \(V \sim \chi^2(n)\) independent: \(F = (U/m)/(V/n) \sim F(m,n)\) |
Cauchy(\(x_0, \gamma\))
| Item | Value |
|---|---|
| Parameters | \(x_0 \in \mathbb{R}\) (location), \(\gamma > 0\) (scale) |
| Support | \(\mathbb{R}\) |
| \(\frac{1}{\pi\gamma}\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]^{-1}\) | |
| Mean | Undefined |
| Variance | Undefined |
| Notes | \(t(1)\) distribution; no moments; ratio of two independent normals |
Laplace(\(\mu, b\))
| Item | Value |
|---|---|
| Parameters | \(\mu \in \mathbb{R}\) (location), \(b > 0\) (scale) |
| Support | \(\mathbb{R}\) |
| \(\frac{1}{2b}\exp\!\left(-\frac{|x-\mu|}{b}\right)\) | |
| Mean | \(\mu\) |
| Variance | \(2b^2\) |
| Notes | Prior for LASSO (\(b = 1/\lambda\)); equivalent to L1 regularization in MAP estimation |
Multivariate Normal(\(\boldsymbol\mu, \Sigma\))
| Item | Value |
|---|---|
| Parameters | \(\boldsymbol\mu \in \mathbb{R}^p\), \(\Sigma \in \mathbb{S}^p_{++}\) |
| Support | \(\mathbb{R}^p\) |
| \((2\pi)^{-p/2}\det(\Sigma)^{-1/2}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right)\) | |
| Mean | \(\boldsymbol\mu\) |
| Covariance | \(\Sigma\) |
| MGF | \(\exp(\boldsymbol\mu^T\mathbf{t} + \frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t})\) |
| Notes | Marginals and conditionals are also normal |
Dirichlet(\(\boldsymbol\alpha\))
| Item | Value |
|---|---|
| Parameters | \(\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_K)\), \(\alpha_k > 0\) |
| Support | Simplex \(\{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}\) |
| \(\frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)}\prod_k p_k^{\alpha_k - 1}\) where \(\alpha_0 = \sum_k\alpha_k\) | |
| Mean | \(\alpha_k/\alpha_0\) for each component |
| Notes | Conjugate prior for Categorical/Multinomial; generalization of Beta |
Wishart(\(V, n\))
| Item | Value |
|---|---|
| Parameters | \(V \in \mathbb{S}^p_{++}\) (scale), \(n > p-1\) (df) |
| Support | \(\mathbb{S}^p_{++}\) |
| Mean | \(nV\) |
| Notes | Distribution of \(\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^T\) for \(\mathbf{x}_i \stackrel{iid}{\sim} N(0, V)\); matrix generalization of \(\chi^2\); conjugate prior for precision matrix |
5 Section 4: OLS Cheat Sheet
5.1 Setup
\[ y = X\beta + \varepsilon, \quad y \in \mathbb{R}^n,\ X \in \mathbb{R}^{n \times k},\ \beta \in \mathbb{R}^k,\ \varepsilon \in \mathbb{R}^n \]
5.2 Core Formulas
| Quantity | Formula | Notes |
|---|---|---|
| OLS estimator | \(\hat{\beta} = (X^TX)^{-1}X^Ty\) | Requires \(\text{rank}(X) = k\) |
| Fitted values | \(\hat{y} = X\hat{\beta} = P_X y\) | |
| Residuals | \(\hat{\varepsilon} = y - \hat{y} = M_X y\) | \(M_X = I - P_X\) |
| Projection matrix | \(P_X = X(X^TX)^{-1}X^T\) | Symmetric, idempotent: \(P_X^2 = P_X\) |
| Annihilator | \(M_X = I - P_X\) | Symmetric, idempotent: \(M_X^2 = M_X\) |
| Error variance | \(\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)\) | Unbiased estimator of \(\sigma^2\) |
| Var of \(\hat{\beta}\) | \(\text{Var}(\hat{\beta}|X) = \sigma^2(X^TX)^{-1}\) | Under homoskedasticity |
| Std error of \(\hat{\beta}_j\) | \(\hat{\text{se}}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(X^TX)^{-1}]_{jj}}\) |
5.3 Goodness of Fit
| Quantity | Formula |
|---|---|
| SSR | \(\hat{\varepsilon}^T\hat{\varepsilon} = y^TM_X y\) |
| SST | \(y^T M_\iota y = \sum_i(y_i - \bar{y})^2\) |
| SSE | \(\text{SST} - \text{SSR}\) |
| \(R^2\) | \(1 - \text{SSR}/\text{SST} = \text{SSE}/\text{SST}\) |
| \(\bar{R}^2\) | \(1 - (1-R^2)(n-1)/(n-k)\) |
5.4 Test Statistics
| Test | Statistic | Distribution under \(H_0\) |
|---|---|---|
| \(t\)-test: \(H_0: \beta_j = 0\) | \(t_j = \hat{\beta}_j / \hat{\text{se}}(\hat{\beta}_j)\) | \(t(n-k)\) |
| \(F\)-test: \(H_0: R\beta = r\) (\(q\) restrictions) | \(F = \frac{(R\hat{\beta}-r)^T[R(X^TX)^{-1}R^T]^{-1}(R\hat{\beta}-r)/q}{\hat{\sigma}^2}\) | \(F(q, n-k)\) |
| \(F\)-test (model vs restricted) | \(F = \frac{(\text{SSR}_0 - \text{SSR}_1)/q}{\text{SSR}_1/(n-k)}\) | \(F(q, n-k)\) |
| Wald statistic | \(W = (R\hat{\beta} - r)^T[\hat{V}(R\hat{\beta})]^{-1}(R\hat{\beta} - r)\) | \(\chi^2(q)\) asymptotically |
5.5 Estimator Generalizations
| Estimator | Formula | When to use |
|---|---|---|
| GLS | \(\hat{\beta}_{GLS} = (X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}y\) | Known \(\Omega = \text{Var}(\varepsilon|X)/\sigma^2\) |
| FGLS | Replace \(\Omega\) with \(\hat{\Omega}\) | Estimated \(\Omega\) (feasible) |
| IV | \(\hat{\beta}_{IV} = (Z^TX)^{-1}Z^Ty\) | One instrument per regressor (\(\ell = k\)) |
| 2SLS | \(\hat{\beta}_{2SLS} = (X^TP_ZX)^{-1}X^TP_Zy\) | Overidentified (\(\ell \geq k\)) |
| GMM | \(\hat{\beta}_{GMM} = (X^TZW Z^TX)^{-1}X^TZWZ^Ty\) | Efficient IV with weight matrix \(W\) |
6 Section 5: Optimization Cheat Sheet
6.1 Unconstrained Optimization
| Condition | Formula | Interpretation |
|---|---|---|
| First-order necessary (FOC) | \(\nabla f(x^*) = 0\) | Zero gradient at optimum |
| Second-order sufficient (SOC) — minimum | \(\nabla^2 f(x^*) \succ 0\) | Positive definite Hessian |
| Second-order sufficient (SOC) — maximum | \(\nabla^2 f(x^*) \prec 0\) | Negative definite Hessian |
| Saddle point | \(\nabla^2 f(x^*)\) indefinite | Neither min nor max |
6.2 Constrained Optimization — Equality Constraints
Problem: \(\min_x f(x)\) subject to \(g(x) = 0\)
Lagrangian: \(\mathcal{L}(x, \lambda) = f(x) - \lambda^T g(x)\)
FOC: \(\nabla_x \mathcal{L} = 0\) and \(\nabla_\lambda \mathcal{L} = 0\), i.e., \(\nabla f(x) = \lambda^T \nabla g(x)\) and \(g(x) = 0\).
6.3 KKT Conditions — Inequality Constraints
Problem: \(\min_x f(x)\) subject to \(g_i(x) \leq 0\), \(i = 1, \ldots, m\), and \(h_j(x) = 0\), \(j = 1, \ldots, p\).
KKT conditions (necessary for optimality):
\[ \nabla f(x^*) + \sum_i \mu_i \nabla g_i(x^*) + \sum_j \lambda_j \nabla h_j(x^*) = 0 \quad \text{(stationarity)} \]
\[ g_i(x^*) \leq 0 \quad \forall i \quad \text{(primal feasibility)} \]
\[ h_j(x^*) = 0 \quad \forall j \quad \text{(primal feasibility)} \]
\[ \mu_i \geq 0 \quad \forall i \quad \text{(dual feasibility)} \]
\[ \mu_i g_i(x^*) = 0 \quad \forall i \quad \text{(complementary slackness)} \]
6.4 Duality
| Concept | Formula |
|---|---|
| Dual function | \(d(\lambda, \mu) = \inf_x \mathcal{L}(x, \lambda, \mu)\) |
| Weak duality | \(d(\lambda, \mu) \leq p^*\) always |
| Strong duality | \(d(\lambda^*, \mu^*) = p^*\) (holds under Slater’s condition for convex problems) |
| Duality gap | \(p^* - d^*\) |
6.5 Regularized Regression Formulas
| Method | Estimator | Closed-form |
|---|---|---|
| Ridge (\(L^2\)) | \(\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2\) | \(\hat{\beta}_\lambda = (X^TX + \lambda I)^{-1}X^Ty\) |
| LASSO (\(L^1\)) | \(\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|_1\) | No closed form; coordinate descent |
| LASSO solution | Soft-threshold | \(\hat{\beta}_j = \text{sign}(\hat{\beta}_j^{OLS})(|\hat{\beta}_j^{OLS}| - \lambda/2)_+\) |
| Elastic Net | \(\min_\beta \|y-X\beta\|^2 + \alpha\lambda\|\beta\|_1 + (1-\alpha)\lambda\|\beta\|^2\) | No closed form |
Soft-threshold operator: \(S(z, \gamma) = \text{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma \\ 0 & |z| \leq \gamma \\ z + \gamma & z < -\gamma \end{cases}\)
6.6 Gradient Descent Variants
| Method | Update Rule | Notes |
|---|---|---|
| Gradient descent | \(x_{t+1} = x_t - \eta \nabla f(x_t)\) | Full gradient |
| SGD | \(x_{t+1} = x_t - \eta \nabla f_i(x_t)\) | Single sample \(i\) |
| Mini-batch SGD | \(x_{t+1} = x_t - \eta \frac{1}{|B|}\sum_{i \in B}\nabla f_i(x_t)\) | Batch \(B\) |
| Momentum | \(v_{t+1} = \gamma v_t + \eta \nabla f(x_t)\); \(x_{t+1} = x_t - v_{t+1}\) | \(\gamma \approx 0.9\) |
| Newton | \(x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1}\nabla f(x_t)\) | Quadratic convergence |
7 Section 6: Information Theory Quick Reference
| Quantity | Formula | Notes |
|---|---|---|
| Entropy | \(H(X) = -\sum_x p(x)\log p(x)\) | \(= -E[\log p(X)]\); bits if log base 2, nats if natural log |
| Differential entropy | \(h(X) = -\int p(x)\log p(x)\,dx\) | Can be negative |
| Conditional entropy | \(H(X|Y) = -\sum_{x,y} p(x,y)\log p(x|y)\) | \(= H(X,Y) - H(Y)\) |
| Joint entropy | \(H(X,Y) = -\sum_{x,y} p(x,y)\log p(x,y)\) | |
| Mutual information | \(I(X;Y) = H(X) - H(X|Y)\) | \(= H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)\) |
| KL divergence | \(D_{KL}(P \| Q) = \sum_x p(x)\log\frac{p(x)}{q(x)}\) | \(\geq 0\); not symmetric |
| Cross-entropy | \(H(P, Q) = -\sum_x p(x)\log q(x)\) | \(= H(P) + D_{KL}(P\|Q)\) |
| Binary cross-entropy | \(L = -y\log\hat{p} - (1-y)\log(1-\hat{p})\) | Standard logistic regression loss |
7.1 Entropy of Common Distributions
| Distribution | Entropy |
|---|---|
| Bernoulli(\(p\)) | \(-p\log p - (1-p)\log(1-p)\) |
| \(\text{Uniform}\{1,\ldots,n\}\) | \(\log n\) |
| \(N(\mu, \sigma^2)\) | \(\frac{1}{2}\log(2\pi e \sigma^2)\) |
| \(N(\boldsymbol\mu, \Sigma)\) | \(\frac{1}{2}\log\det(2\pi e\,\Sigma)\) |
| \(\text{Exponential}(\lambda)\) | \(1 - \log\lambda\) |
7.2 KL Divergence Between Gaussians
\[ D_{KL}\!\left(N(\mu_1, \Sigma_1) \| N(\mu_2, \Sigma_2)\right) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1) - p + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right] \]
8 Section 7: Key Theorems Quick Reference
| Theorem | Statement | Where It’s Used |
|---|---|---|
| Gauss-Markov | Under classical OLS assumptions, \(\hat{\beta}_{OLS}\) is BLUE: \(\text{Var}(\hat{\beta}_{OLS}) \preceq \text{Var}(\tilde{\beta})\) for any linear unbiased \(\tilde{\beta}\) | Justifies OLS when \(\text{Var}(\varepsilon|X) = \sigma^2 I\) |
| Frisch-Waugh-Lovell (FWL) | In \(y = X_1\beta_1 + X_2\beta_2 + \varepsilon\), \(\hat{\beta}_2\) from regressing \(M_{X_1}y\) on \(M_{X_1}X_2\) | Within estimator (FE), partialling out regressors |
| Rank-Nullity | \(\text{rank}(A) + \text{nullity}(A) = n\) for \(A \in \mathbb{R}^{m \times n}\) | Degrees of freedom, dimensionality analysis |
| Eckart-Young-Mirsky | Best rank-\(k\) approximation of \(A\) in spectral/\(F\)-norm is \(\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^T\) (truncated SVD) | PCA, low-rank approximation, compression |
| Universal Approximation | A single hidden layer NN with enough neurons can approximate any continuous function on a compact set | Theoretical basis for neural networks |
| Central Limit Theorem (CLT) | \(\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)\) as \(n \to \infty\) | Asymptotic inference, hypothesis testing, OLS asymptotics |
| Law of Large Numbers (LLN) | \(\bar{X}_n \xrightarrow{p} \mu\) as \(n \to \infty\) | Consistency of estimators, justifying probability limits |
| Delta Method | If \(\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)\) and \(g\) is differentiable, then \(\sqrt{n}(g(\hat{\theta}) - g(\theta_0)) \xrightarrow{d} N(0, [g'(\theta_0)]^T V g'(\theta_0))\) | Asymptotic distribution of nonlinear transformations |
| Slutsky’s Theorem | If \(X_n \xrightarrow{d} X\) and \(Y_n \xrightarrow{p} c\) (constant), then \(X_n + Y_n \xrightarrow{d} X + c\) and \(X_n Y_n \xrightarrow{d} cX\) | Combining convergence results in asymptotic proofs |
| Continuous Mapping Theorem | If \(X_n \xrightarrow{d} X\) and \(g\) is continuous, then \(g(X_n) \xrightarrow{d} g(X)\) | Deriving distributions of test statistics |
| Jensen’s Inequality | For convex \(\phi\): \(\phi(E[X]) \leq E[\phi(X)]\) | Information theory proofs, ELBO derivation, EM algorithm |
| Bayes’ Theorem | \(P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)}\), i.e., posterior \(\propto\) likelihood \(\times\) prior | Bayesian inference, MAP estimation |
| Cauchy-Schwarz | \(|E[XY]|^2 \leq E[X^2]E[Y^2]\); equivalently \(|\mathbf{x}^T\mathbf{y}|^2 \leq \|\mathbf{x}\|^2\|\mathbf{y}\|^2\) | Bounding correlations, proving \(|\rho| \leq 1\) |
| Spectral Theorem | Every real symmetric matrix \(A\) has \(A = Q\Lambda Q^T\) with orthogonal \(Q\) and real diagonal \(\Lambda\) | PCA, quadratic forms, covariance analysis |
| Law of Iterated Expectations | \(E[X] = E[E[X|Y]]\) | Simplifying nested expectations; iterated projections |
| Law of Total Variance | \(\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y])\) | Variance decomposition; ANOVA; random effects models |
Back to Appendix Overview | Previous: Notation Guide