Formula Cheat Sheets

Quick Reference untuk Saat Lupa Rumus

1 Overview

Seven sections of dense, scannable formulas. No derivations here — go to Common Proofs for derivations. This page is for when you need a formula fast.

2 Section 1: Linear Algebra Essentials

2.1 Matrix Products

\[ (AB)_{ij} = \sum_k A_{ik} B_{kj} \]

\[ \text{Dimensions: } A \in \mathbb{R}^{m \times n},\ B \in \mathbb{R}^{n \times p} \Rightarrow AB \in \mathbb{R}^{m \times p} \]

2.2 Transpose Rules

Identity	Formula
Transpose of product	$(AB)^T = B^T A^T$
Double transpose	$(A^T)^T = A$
Sum transpose	$(A + B)^T = A^T + B^T$
Scalar transpose	$(\alpha A)^T = \alpha A^T$
Triple product	$(ABC)^T = C^T B^T A^T$

2.3 Inverse Rules

Identity	Formula	Condition
Product inverse	$(AB)^{-1} = B^{-1}A^{-1}$	Both invertible
Inverse transpose	$(A^T)^{-1} = (A^{-1})^T$	$A$ invertible
Double inverse	$(A^{-1})^{-1} = A$	$A$ invertible
Scalar inverse	$(\alpha A)^{-1} = \alpha^{-1} A^{-1}$	$\alpha \neq 0$

2.4 2×2 Inverse

\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]

Valid iff $ad - bc \neq 0$.

2.5 Woodbury Matrix Identity

\[ (A + UCV)^{-1} = A^{-1} - A^{-1}U\left(C^{-1} + VA^{-1}U\right)^{-1}VA^{-1} \]

When to use: Converting a large-matrix inverse to a small-matrix inverse. If $A \in \mathbb{R}^{n\times n}$ and $U, V \in \mathbb{R}^{n \times k}$ with $k \ll n$, this replaces an $n \times n$ inverse with a $k \times k$ inverse.

Special case (Sherman-Morrison): $U = \mathbf{u}$, $C = 1$, $V = \mathbf{v}^T$: \[ (A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^TA^{-1}}{1 + \mathbf{v}^TA^{-1}\mathbf{u}} \]

Matrix determinant lemma: \[ \det(A + \mathbf{u}\mathbf{v}^T) = (1 + \mathbf{v}^TA^{-1}\mathbf{u})\det(A) \]

2.6 Trace Identities

Identity	Formula
Cyclic property	$\text{tr}(AB) = \text{tr}(BA)$
Cyclic (triple)	$\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$
Linearity	$\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)$
Scalar	$\text{tr}(\alpha A) = \alpha\,\text{tr}(A)$
Transpose	$\text{tr}(A) = \text{tr}(A^T)$
Quadratic form	$\mathbf{x}^TA\mathbf{x} = \text{tr}(\mathbf{x}^TA\mathbf{x}) = \text{tr}(A\mathbf{x}\mathbf{x}^T)$
Frobenius inner product	$\text{tr}(A^TB) = \sum_{i,j}A_{ij}B_{ij}$

2.7 Eigenvalue Identities

Quantity	Formula
Trace	$\text{tr}(A) = \sum_{i=1}^n \lambda_i$
Determinant	$\det(A) = \prod_{i=1}^n \lambda_i$
Eigenvalues of $A^{-1}$	$1/\lambda_i$
Eigenvalues of $A^k$	$\lambda_i^k$
Eigenvalues of $A + cI$	$\lambda_i + c$
Eigenvalues of $A^T$	Same as $A$

2.8 SVD and Related

SVD: $A = U \Sigma V^T$ where $U \in \mathbb{R}^{m \times m}$, $\Sigma \in \mathbb{R}^{m \times n}$, $V \in \mathbb{R}^{n \times n}$, $U, V$ orthogonal.

Quantity	Expression
$\text{rank}(A)$	Number of nonzero singular values
$A^+$ (pseudoinverse)	$V\Sigma^+ U^T$
$\\|A\\|_2$ (spectral norm)	$\sigma_1$ (largest singular value)
$\\|A\\|_F$ (Frobenius norm)	$\sqrt{\sum_i \sigma_i^2}$
$\\|A\\|_*$ (nuclear norm)	$\sum_i \sigma_i$
Eigenvalues of $A^TA$	$\sigma_i^2$
Best rank-$k$ approximation	$\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T$ (Eckart-Young)

3 Section 2: Matrix Calculus Quick Reference

Convention: $f: \mathbb{R}^n \to \mathbb{R}$, gradient $\nabla_\mathbf{x} f$ is a column vector.

3.1 Derivatives with Respect to a Vector $\mathbf{x}$

Expression	Derivative $\frac{\partial}{\partial \mathbf{x}}$
$\mathbf{a}^T \mathbf{x}$	$\mathbf{a}$
$\mathbf{x}^T \mathbf{a}$	$\mathbf{a}$
$\mathbf{x}^T \mathbf{x}$	$2\mathbf{x}$
$\mathbf{x}^T A \mathbf{x}$	$(A + A^T)\mathbf{x}$
$\mathbf{x}^T A \mathbf{x}$ (A symmetric)	$2A\mathbf{x}$
$\mathbf{a}^T \mathbf{x} \mathbf{x}^T \mathbf{b}$	$(\mathbf{a}^T\mathbf{x})\mathbf{b} + (\mathbf{b}^T\mathbf{x})\mathbf{a}$
$\\|A\mathbf{x} - \mathbf{b}\\|^2$	$2A^T(A\mathbf{x} - \mathbf{b})$
$\\|A\mathbf{x} - \mathbf{b}\\|^2_W$ (weighted)	$2A^TW(A\mathbf{x} - \mathbf{b})$

3.2 Derivatives with Respect to a Matrix $X$

Expression	Derivative $\frac{\partial}{\partial X}$	Shape
$\mathbf{a}^T X \mathbf{b}$	$\mathbf{a}\mathbf{b}^T$	Same as $X$
$\text{tr}(AX)$	$A^T$	Same as $X$
$\text{tr}(X^TA)$	$A$	Same as $X$
$\text{tr}(AXB)$	$A^T(B)^T = (BA)^T$	Same as $X$
$\text{tr}(AXBX^T)$	$B^TX^TA^T + BX^TA$	Same as $X$
$\text{tr}(X^TAX)$	$(A + A^T)X$	Same as $X$
$\text{tr}(XAX^T)$	$X(A + A^T)$	Same as $X$
$\log\det(X)$	$X^{-T} = (X^T)^{-1}$	Same as $X$
$\log\det(X)$ (X symmetric)	$X^{-1}$	Same as $X$
$\det(X)$	$\det(X) X^{-T}$	Same as $X$
$\text{tr}(X^{-1}A)$	$-X^{-T}AX^{-T}$	Same as $X$

3.3 Second Derivatives (Hessians)

Expression	Hessian $\frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{x}^T}$
$\mathbf{a}^T\mathbf{x}$	$0$
$\mathbf{x}^T A \mathbf{x}$	$A + A^T$
$\mathbf{x}^T A \mathbf{x}$ (A symmetric)	$2A$
$\\|A\mathbf{x} - \mathbf{b}\\|^2$	$2A^TA$

4 Section 3: Probability Distribution Summary

4.1 Discrete Distributions

Bernoulli($p$)

Item	Value
Parameters	$p \in [0,1]$
Support	$\{0, 1\}$
PMF	$P(X=k) = p^k(1-p)^{1-k}$
Mean	$p$
Variance	$p(1-p)$
MGF	$1 - p + pe^t$

Binomial($n, p$)

Item	Value
Parameters	$n \in \mathbb{N}$, $p \in [0,1]$
Support	$\{0, 1, \ldots, n\}$
PMF	$\binom{n}{k}p^k(1-p)^{n-k}$
Mean	$np$
Variance	$np(1-p)$
MGF	$(1 - p + pe^t)^n$

Geometric($p$)

Item	Value
Parameters	$p \in (0,1]$
Support	$\{1, 2, 3, \ldots\}$
PMF	$(1-p)^{k-1}p$
Mean	$1/p$
Variance	$(1-p)/p^2$
Notes	Number of trials until first success

Poisson($\lambda$)

Item	Value
Parameters	$\lambda > 0$
Support	$\{0, 1, 2, \ldots\}$
PMF	$e^{-\lambda}\lambda^k / k!$
Mean	$\lambda$
Variance	$\lambda$
MGF	$e^{\lambda(e^t - 1)}$
Notes	Limit of Binomial as $n \to \infty$, $np \to \lambda$

Negative Binomial($r, p$)

Item	Value
Parameters	$r > 0$, $p \in (0,1)$
Support	$\{0, 1, 2, \ldots\}$
PMF	$\binom{k+r-1}{k}p^r(1-p)^k$
Mean	$r(1-p)/p$
Variance	$r(1-p)/p^2$
Notes	$k$ failures before $r$-th success; overdispersed count model

4.2 Continuous Distributions

Uniform($a, b$)

Item	Value
Parameters	$a < b$
Support	$[a, b]$
PDF	$1/(b-a)$
Mean	$(a+b)/2$
Variance	$(b-a)^2/12$
MGF	$(e^{tb} - e^{ta})/(t(b-a))$

Normal($\mu, \sigma^2$)

Item	Value
Parameters	$\mu \in \mathbb{R}$, $\sigma^2 > 0$
Support	$\mathbb{R}$
PDF	$\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$
Mean	$\mu$
Variance	$\sigma^2$
MGF	$\exp(\mu t + \sigma^2 t^2/2)$
Notes	If $X \sim N(\mu,\sigma^2)$, then $(X-\mu)/\sigma \sim N(0,1)$

Standard Normal

Item	Value
PDF	$\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$
CDF	$\Phi(x) = \int_{-\infty}^x \phi(t)\,dt$
Mean	$0$
Variance	$1$
Key quantiles	$\Phi(1.645) = 0.95$, $\Phi(1.96) = 0.975$, $\Phi(2.576) = 0.995$

Exponential($\lambda$)

Item	Value
Parameters	$\lambda > 0$ (rate)
Support	$[0, \infty)$
PDF	$\lambda e^{-\lambda x}$
Mean	$1/\lambda$
Variance	$1/\lambda^2$
MGF	$\lambda/(\lambda - t)$ for $t < \lambda$
Notes	Memoryless; hazard rate $= \lambda$ (constant)

Gamma($\alpha, \beta$)

Item	Value
Parameters	$\alpha > 0$ (shape), $\beta > 0$ (rate)
Support	$(0, \infty)$
PDF	$\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$
Mean	$\alpha/\beta$
Variance	$\alpha/\beta^2$
MGF	$(\beta/(\beta-t))^\alpha$ for $t < \beta$
Notes	Exponential $= \text{Gamma}(1, \lambda)$; $\chi^2(k) = \text{Gamma}(k/2, 1/2)$

Beta($\alpha, \beta$)

Item	Value
Parameters	$\alpha, \beta > 0$
Support	$[0, 1]$
PDF	$\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$ where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$
Mean	$\alpha/(\alpha+\beta)$
Variance	$\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
Notes	Conjugate prior for Bernoulli/Binomial; $\text{Uniform}(0,1) = \text{Beta}(1,1)$

Chi-squared($k$)

Item	Value
Parameters	$k > 0$ (degrees of freedom)
Support	$(0, \infty)$
PDF	$\text{Gamma}(k/2, 1/2)$ distribution
Mean	$k$
Variance	$2k$
Notes	If $Z_i \stackrel{iid}{\sim} N(0,1)$, then $\sum_{i=1}^k Z_i^2 \sim \chi^2(k)$

Student’s $t(k)$

Item	Value
Parameters	$k > 0$ (degrees of freedom)
Support	$\mathbb{R}$
PDF	$\frac{\Gamma((k+1)/2)}{\sqrt{k\pi}\,\Gamma(k/2)}\left(1 + \frac{x^2}{k}\right)^{-(k+1)/2}$
Mean	$0$ (for $k > 1$)
Variance	$k/(k-2)$ (for $k > 2$)
Notes	$t(k) \to N(0,1)$ as $k \to \infty$; heavier tails for small $k$

$F(m, n)$

Item	Value
Parameters	$m, n > 0$ (numerator/denominator df)
Support	$(0, \infty)$
Mean	$n/(n-2)$ for $n > 2$
Variance	$\frac{2n^2(m+n-2)}{m(n-2)^2(n-4)}$ for $n > 4$
Notes	If $U \sim \chi^2(m)$, $V \sim \chi^2(n)$ independent: $F = (U/m)/(V/n) \sim F(m,n)$

Cauchy($x_0, \gamma$)

Item	Value
Parameters	$x_0 \in \mathbb{R}$ (location), $\gamma > 0$ (scale)
Support	$\mathbb{R}$
PDF	$\frac{1}{\pi\gamma}\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]^{-1}$
Mean	Undefined
Variance	Undefined
Notes	$t(1)$ distribution; no moments; ratio of two independent normals

Laplace($\mu, b$)

Item	Value
Parameters	$\mu \in \mathbb{R}$ (location), $b > 0$ (scale)
Support	$\mathbb{R}$
PDF	$\frac{1}{2b}\exp\!\left(-\frac{\|x-\mu\|}{b}\right)$
Mean	$\mu$
Variance	$2b^2$
Notes	Prior for LASSO ($b = 1/\lambda$); equivalent to L1 regularization in MAP estimation

Multivariate Normal($\boldsymbol\mu, \Sigma$)

Item	Value
Parameters	$\boldsymbol\mu \in \mathbb{R}^p$, $\Sigma \in \mathbb{S}^p_{++}$
Support	$\mathbb{R}^p$
PDF	$(2\pi)^{-p/2}\det(\Sigma)^{-1/2}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right)$
Mean	$\boldsymbol\mu$
Covariance	$\Sigma$
MGF	$\exp(\boldsymbol\mu^T\mathbf{t} + \frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t})$
Notes	Marginals and conditionals are also normal

Dirichlet($\boldsymbol\alpha$)

Item	Value
Parameters	$\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_K)$, $\alpha_k > 0$
Support	Simplex $\{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}$
PDF	$\frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)}\prod_k p_k^{\alpha_k - 1}$ where $\alpha_0 = \sum_k\alpha_k$
Mean	$\alpha_k/\alpha_0$ for each component
Notes	Conjugate prior for Categorical/Multinomial; generalization of Beta

Wishart($V, n$)

Item	Value
Parameters	$V \in \mathbb{S}^p_{++}$ (scale), $n > p-1$ (df)
Support	$\mathbb{S}^p_{++}$
Mean	$nV$
Notes	Distribution of $\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^T$ for $\mathbf{x}_i \stackrel{iid}{\sim} N(0, V)$; matrix generalization of $\chi^2$; conjugate prior for precision matrix

5 Section 4: OLS Cheat Sheet

5.1 Setup

\[ y = X\beta + \varepsilon, \quad y \in \mathbb{R}^n,\ X \in \mathbb{R}^{n \times k},\ \beta \in \mathbb{R}^k,\ \varepsilon \in \mathbb{R}^n \]

5.2 Core Formulas

Quantity	Formula	Notes
OLS estimator	$\hat{\beta} = (X^TX)^{-1}X^Ty$	Requires $\text{rank}(X) = k$
Fitted values	$\hat{y} = X\hat{\beta} = P_X y$
Residuals	$\hat{\varepsilon} = y - \hat{y} = M_X y$	$M_X = I - P_X$
Projection matrix	$P_X = X(X^TX)^{-1}X^T$	Symmetric, idempotent: $P_X^2 = P_X$
Annihilator	$M_X = I - P_X$	Symmetric, idempotent: $M_X^2 = M_X$
Error variance	$\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)$	Unbiased estimator of $\sigma^2$
Var of $\hat{\beta}$	$\text{Var}(\hat{\beta}\|X) = \sigma^2(X^TX)^{-1}$	Under homoskedasticity
Std error of $\hat{\beta}_j$	$\hat{\text{se}}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(X^TX)^{-1}]_{jj}}$

5.3 Goodness of Fit

Quantity	Formula
SSR	$\hat{\varepsilon}^T\hat{\varepsilon} = y^TM_X y$
SST	$y^T M_\iota y = \sum_i(y_i - \bar{y})^2$
SSE	$\text{SST} - \text{SSR}$
$R^2$	$1 - \text{SSR}/\text{SST} = \text{SSE}/\text{SST}$
$\bar{R}^2$	$1 - (1-R^2)(n-1)/(n-k)$

5.4 Test Statistics

Test	Statistic	Distribution under $H_0$
$t$-test: $H_0: \beta_j = 0$	$t_j = \hat{\beta}_j / \hat{\text{se}}(\hat{\beta}_j)$	$t(n-k)$
$F$-test: $H_0: R\beta = r$ ($q$ restrictions)	$F = \frac{(R\hat{\beta}-r)^T[R(X^TX)^{-1}R^T]^{-1}(R\hat{\beta}-r)/q}{\hat{\sigma}^2}$	$F(q, n-k)$
$F$-test (model vs restricted)	$F = \frac{(\text{SSR}_0 - \text{SSR}_1)/q}{\text{SSR}_1/(n-k)}$	$F(q, n-k)$
Wald statistic	$W = (R\hat{\beta} - r)^T[\hat{V}(R\hat{\beta})]^{-1}(R\hat{\beta} - r)$	$\chi^2(q)$ asymptotically

5.5 Estimator Generalizations

Estimator	Formula	When to use
GLS	$\hat{\beta}_{GLS} = (X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}y$	Known $\Omega = \text{Var}(\varepsilon\|X)/\sigma^2$
FGLS	Replace $\Omega$ with $\hat{\Omega}$	Estimated $\Omega$ (feasible)
IV	$\hat{\beta}_{IV} = (Z^TX)^{-1}Z^Ty$	One instrument per regressor ($\ell = k$)
2SLS	$\hat{\beta}_{2SLS} = (X^TP_ZX)^{-1}X^TP_Zy$	Overidentified ($\ell \geq k$)
GMM	$\hat{\beta}_{GMM} = (X^TZW Z^TX)^{-1}X^TZWZ^Ty$	Efficient IV with weight matrix $W$

6 Section 5: Optimization Cheat Sheet

6.1 Unconstrained Optimization

Condition	Formula	Interpretation
First-order necessary (FOC)	$\nabla f(x^*) = 0$	Zero gradient at optimum
Second-order sufficient (SOC) — minimum	$\nabla^2 f(x^*) \succ 0$	Positive definite Hessian
Second-order sufficient (SOC) — maximum	$\nabla^2 f(x^*) \prec 0$	Negative definite Hessian
Saddle point	$\nabla^2 f(x^*)$ indefinite	Neither min nor max

6.2 Constrained Optimization — Equality Constraints

Problem: $\min_x f(x)$ subject to $g(x) = 0$

Lagrangian: $\mathcal{L}(x, \lambda) = f(x) - \lambda^T g(x)$

FOC: $\nabla_x \mathcal{L} = 0$ and $\nabla_\lambda \mathcal{L} = 0$, i.e., $\nabla f(x) = \lambda^T \nabla g(x)$ and $g(x) = 0$.

6.3 KKT Conditions — Inequality Constraints

Problem: $\min_x f(x)$ subject to $g_i(x) \leq 0$, $i = 1, \ldots, m$, and $h_j(x) = 0$, $j = 1, \ldots, p$.

KKT conditions (necessary for optimality):

\[ \nabla f(x^*) + \sum_i \mu_i \nabla g_i(x^*) + \sum_j \lambda_j \nabla h_j(x^*) = 0 \quad \text{(stationarity)} \]

\[ g_i(x^*) \leq 0 \quad \forall i \quad \text{(primal feasibility)} \]

\[ h_j(x^*) = 0 \quad \forall j \quad \text{(primal feasibility)} \]

\[ \mu_i \geq 0 \quad \forall i \quad \text{(dual feasibility)} \]

\[ \mu_i g_i(x^*) = 0 \quad \forall i \quad \text{(complementary slackness)} \]

6.4 Duality

Concept	Formula
Dual function	$d(\lambda, \mu) = \inf_x \mathcal{L}(x, \lambda, \mu)$
Weak duality	$d(\lambda, \mu) \leq p^*$ always
Strong duality	$d(\lambda^, \mu^) = p^*$ (holds under Slater’s condition for convex problems)
Duality gap	$p^* - d^*$

6.5 Regularized Regression Formulas

Method	Estimator	Closed-form
Ridge ($L^2$)	$\min_\beta \\|y - X\beta\\|^2 + \lambda\\|\beta\\|^2$	$\hat{\beta}_\lambda = (X^TX + \lambda I)^{-1}X^Ty$
LASSO ($L^1$)	$\min_\beta \\|y - X\beta\\|^2 + \lambda\\|\beta\\|_1$	No closed form; coordinate descent
LASSO solution	Soft-threshold	$\hat{\beta}_j = \text{sign}(\hat{\beta}_j^{OLS})(\|\hat{\beta}_j^{OLS}\| - \lambda/2)_+$
Elastic Net	$\min_\beta \\|y-X\beta\\|^2 + \alpha\lambda\\|\beta\\|_1 + (1-\alpha)\lambda\\|\beta\\|^2$	No closed form

Soft-threshold operator: $S(z, \gamma) = \text{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma \\ 0 & |z| \leq \gamma \\ z + \gamma & z < -\gamma \end{cases}$

6.6 Gradient Descent Variants

Method	Update Rule	Notes
Gradient descent	$x_{t+1} = x_t - \eta \nabla f(x_t)$	Full gradient
SGD	$x_{t+1} = x_t - \eta \nabla f_i(x_t)$	Single sample $i$
Mini-batch SGD	$x_{t+1} = x_t - \eta \frac{1}{\|B\|}\sum_{i \in B}\nabla f_i(x_t)$	Batch $B$
Momentum	$v_{t+1} = \gamma v_t + \eta \nabla f(x_t)$; $x_{t+1} = x_t - v_{t+1}$	$\gamma \approx 0.9$
Newton	$x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1}\nabla f(x_t)$	Quadratic convergence

7 Section 6: Information Theory Quick Reference

Quantity	Formula	Notes
Entropy	$H(X) = -\sum_x p(x)\log p(x)$	$= -E[\log p(X)]$; bits if log base 2, nats if natural log
Differential entropy	$h(X) = -\int p(x)\log p(x)\,dx$	Can be negative
Conditional entropy	$H(X\|Y) = -\sum_{x,y} p(x,y)\log p(x\|y)$	$= H(X,Y) - H(Y)$
Joint entropy	$H(X,Y) = -\sum_{x,y} p(x,y)\log p(x,y)$
Mutual information	$I(X;Y) = H(X) - H(X\|Y)$	$= H(Y) - H(Y\|X) = H(X) + H(Y) - H(X,Y)$
KL divergence	$D_{KL}(P \\| Q) = \sum_x p(x)\log\frac{p(x)}{q(x)}$	$\geq 0$; not symmetric
Cross-entropy	$H(P, Q) = -\sum_x p(x)\log q(x)$	$= H(P) + D_{KL}(P\\|Q)$
Binary cross-entropy	$L = -y\log\hat{p} - (1-y)\log(1-\hat{p})$	Standard logistic regression loss

7.1 Entropy of Common Distributions

Distribution	Entropy
Bernoulli($p$)	$-p\log p - (1-p)\log(1-p)$
$\text{Uniform}\{1,\ldots,n\}$	$\log n$
$N(\mu, \sigma^2)$	$\frac{1}{2}\log(2\pi e \sigma^2)$
$N(\boldsymbol\mu, \Sigma)$	$\frac{1}{2}\log\det(2\pi e\,\Sigma)$
$\text{Exponential}(\lambda)$	$1 - \log\lambda$

7.2 KL Divergence Between Gaussians

\[ D_{KL}\!\left(N(\mu_1, \Sigma_1) \| N(\mu_2, \Sigma_2)\right) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1) - p + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right] \]

8 Section 7: Key Theorems Quick Reference

Theorem	Statement	Where It’s Used
Gauss-Markov	Under classical OLS assumptions, $\hat{\beta}_{OLS}$ is BLUE: $\text{Var}(\hat{\beta}_{OLS}) \preceq \text{Var}(\tilde{\beta})$ for any linear unbiased $\tilde{\beta}$	Justifies OLS when $\text{Var}(\varepsilon\|X) = \sigma^2 I$
Frisch-Waugh-Lovell (FWL)	In $y = X_1\beta_1 + X_2\beta_2 + \varepsilon$, $\hat{\beta}_2$ from regressing $M_{X_1}y$ on $M_{X_1}X_2$	Within estimator (FE), partialling out regressors
Rank-Nullity	$\text{rank}(A) + \text{nullity}(A) = n$ for $A \in \mathbb{R}^{m \times n}$	Degrees of freedom, dimensionality analysis
Eckart-Young-Mirsky	Best rank-$k$ approximation of $A$ in spectral/$F$-norm is $\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^T$ (truncated SVD)	PCA, low-rank approximation, compression
Universal Approximation	A single hidden layer NN with enough neurons can approximate any continuous function on a compact set	Theoretical basis for neural networks
Central Limit Theorem (CLT)	$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ as $n \to \infty$	Asymptotic inference, hypothesis testing, OLS asymptotics
Law of Large Numbers (LLN)	$\bar{X}_n \xrightarrow{p} \mu$ as $n \to \infty$	Consistency of estimators, justifying probability limits
Delta Method	If $\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ is differentiable, then $\sqrt{n}(g(\hat{\theta}) - g(\theta_0)) \xrightarrow{d} N(0, [g'(\theta_0)]^T V g'(\theta_0))$	Asymptotic distribution of nonlinear transformations
Slutsky’s Theorem	If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant), then $X_n + Y_n \xrightarrow{d} X + c$ and $X_n Y_n \xrightarrow{d} cX$	Combining convergence results in asymptotic proofs
Continuous Mapping Theorem	If $X_n \xrightarrow{d} X$ and $g$ is continuous, then $g(X_n) \xrightarrow{d} g(X)$	Deriving distributions of test statistics
Jensen’s Inequality	For convex $\phi$: $\phi(E[X]) \leq E[\phi(X)]$	Information theory proofs, ELBO derivation, EM algorithm
Bayes’ Theorem	$P(\theta\|x) = \frac{P(x\|\theta)P(\theta)}{P(x)}$, i.e., posterior $\propto$ likelihood $\times$ prior	Bayesian inference, MAP estimation
Cauchy-Schwarz	$\|E[XY]\|^2 \leq E[X^2]E[Y^2]$; equivalently $\|\mathbf{x}^T\mathbf{y}\|^2 \leq \\|\mathbf{x}\\|^2\\|\mathbf{y}\\|^2$	Bounding correlations, proving $\|\rho\| \leq 1$
Spectral Theorem	Every real symmetric matrix $A$ has $A = Q\Lambda Q^T$ with orthogonal $Q$ and real diagonal $\Lambda$	PCA, quadratic forms, covariance analysis
Law of Iterated Expectations	$E[X] = E[E[X\|Y]]$	Simplifying nested expectations; iterated projections
Law of Total Variance	$\text{Var}(X) = E[\text{Var}(X\|Y)] + \text{Var}(E[X\|Y])$	Variance decomposition; ANOVA; random effects models

Back to Appendix Overview | Previous: Notation Guide

--- title: "Formula Cheat Sheets" subtitle: "Quick Reference untuk Saat Lupa Rumus" --- ## Overview Seven sections of dense, scannable formulas. No derivations here — go to [Common Proofs](common-proofs.qmd) for derivations. This page is for when you need a formula fast. --- ## Section 1: Linear Algebra Essentials ### Matrix Products $$ (AB)_{ij} = \sum_k A_{ik} B_{kj} $$ $$ \text{Dimensions: } A \in \mathbb{R}^{m \times n},\ B \in \mathbb{R}^{n \times p} \Rightarrow AB \in \mathbb{R}^{m \times p} $$ ### Transpose Rules | Identity | Formula | |----------|---------| | Transpose of product | $(AB)^T = B^T A^T$ | | Double transpose | $(A^T)^T = A$ | | Sum transpose | $(A + B)^T = A^T + B^T$ | | Scalar transpose | $(\alpha A)^T = \alpha A^T$ | | Triple product | $(ABC)^T = C^T B^T A^T$ | ### Inverse Rules | Identity | Formula | Condition | |----------|---------|-----------| | Product inverse | $(AB)^{-1} = B^{-1}A^{-1}$ | Both invertible | | Inverse transpose | $(A^T)^{-1} = (A^{-1})^T$ | $A$ invertible | | Double inverse | $(A^{-1})^{-1} = A$ | $A$ invertible | | Scalar inverse | $(\alpha A)^{-1} = \alpha^{-1} A^{-1}$ | $\alpha \neq 0$ | ### 2×2 Inverse $$ \begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} $$ Valid iff $ad - bc \neq 0$. ### Woodbury Matrix Identity $$ (A + UCV)^{-1} = A^{-1} - A^{-1}U\left(C^{-1} + VA^{-1}U\right)^{-1}VA^{-1} $$ **When to use:** Converting a large-matrix inverse to a small-matrix inverse. If $A \in \mathbb{R}^{n\times n}$ and $U, V \in \mathbb{R}^{n \times k}$ with $k \ll n$, this replaces an $n \times n$ inverse with a $k \times k$ inverse. **Special case (Sherman-Morrison):** $U = \mathbf{u}$, $C = 1$, $V = \mathbf{v}^T$: $$ (A + \mathbf{u}\mathbf{v}^T)^{-1} = A^{-1} - \frac{A^{-1}\mathbf{u}\mathbf{v}^TA^{-1}}{1 + \mathbf{v}^TA^{-1}\mathbf{u}} $$ **Matrix determinant lemma:** $$ \det(A + \mathbf{u}\mathbf{v}^T) = (1 + \mathbf{v}^TA^{-1}\mathbf{u})\det(A) $$ ### Trace Identities | Identity | Formula | |----------|---------| | Cyclic property | $\text{tr}(AB) = \text{tr}(BA)$ | | Cyclic (triple) | $\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)$ | | Linearity | $\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)$ | | Scalar | $\text{tr}(\alpha A) = \alpha\,\text{tr}(A)$ | | Transpose | $\text{tr}(A) = \text{tr}(A^T)$ | | Quadratic form | $\mathbf{x}^TA\mathbf{x} = \text{tr}(\mathbf{x}^TA\mathbf{x}) = \text{tr}(A\mathbf{x}\mathbf{x}^T)$ | | Frobenius inner product | $\text{tr}(A^TB) = \sum_{i,j}A_{ij}B_{ij}$ | ### Eigenvalue Identities | Quantity | Formula | |----------|---------| | Trace | $\text{tr}(A) = \sum_{i=1}^n \lambda_i$ | | Determinant | $\det(A) = \prod_{i=1}^n \lambda_i$ | | Eigenvalues of $A^{-1}$ | $1/\lambda_i$ | | Eigenvalues of $A^k$ | $\lambda_i^k$ | | Eigenvalues of $A + cI$ | $\lambda_i + c$ | | Eigenvalues of $A^T$ | Same as $A$ | ### SVD and Related **SVD:** $A = U \Sigma V^T$ where $U \in \mathbb{R}^{m \times m}$, $\Sigma \in \mathbb{R}^{m \times n}$, $V \in \mathbb{R}^{n \times n}$, $U, V$ orthogonal. | Quantity | Expression | |----------|-----------| | $\text{rank}(A)$ | Number of nonzero singular values | | $A^+$ (pseudoinverse) | $V\Sigma^+ U^T$ | | $\|A\|_2$ (spectral norm) | $\sigma_1$ (largest singular value) | | $\|A\|_F$ (Frobenius norm) | $\sqrt{\sum_i \sigma_i^2}$ | | $\|A\|_*$ (nuclear norm) | $\sum_i \sigma_i$ | | Eigenvalues of $A^TA$ | $\sigma_i^2$ | | Best rank-$k$ approximation | $\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T$ (Eckart-Young) | --- ## Section 2: Matrix Calculus Quick Reference Convention: $f: \mathbb{R}^n \to \mathbb{R}$, gradient $\nabla_\mathbf{x} f$ is a column vector. ### Derivatives with Respect to a Vector $\mathbf{x}$ | Expression | Derivative $\frac{\partial}{\partial \mathbf{x}}$ | |------------|--------------------------------------------------| | $\mathbf{a}^T \mathbf{x}$ | $\mathbf{a}$ | | $\mathbf{x}^T \mathbf{a}$ | $\mathbf{a}$ | | $\mathbf{x}^T \mathbf{x}$ | $2\mathbf{x}$ | | $\mathbf{x}^T A \mathbf{x}$ | $(A + A^T)\mathbf{x}$ | | $\mathbf{x}^T A \mathbf{x}$ (A symmetric) | $2A\mathbf{x}$ | | $\mathbf{a}^T \mathbf{x} \mathbf{x}^T \mathbf{b}$ | $(\mathbf{a}^T\mathbf{x})\mathbf{b} + (\mathbf{b}^T\mathbf{x})\mathbf{a}$ | | $\|A\mathbf{x} - \mathbf{b}\|^2$ | $2A^T(A\mathbf{x} - \mathbf{b})$ | | $\|A\mathbf{x} - \mathbf{b}\|^2_W$ (weighted) | $2A^TW(A\mathbf{x} - \mathbf{b})$ | ### Derivatives with Respect to a Matrix $X$ | Expression | Derivative $\frac{\partial}{\partial X}$ | Shape | |------------|------------------------------------------|-------| | $\mathbf{a}^T X \mathbf{b}$ | $\mathbf{a}\mathbf{b}^T$ | Same as $X$ | | $\text{tr}(AX)$ | $A^T$ | Same as $X$ | | $\text{tr}(X^TA)$ | $A$ | Same as $X$ | | $\text{tr}(AXB)$ | $A^T(B)^T = (BA)^T$ | Same as $X$ | | $\text{tr}(AXBX^T)$ | $B^TX^TA^T + BX^TA$ | Same as $X$ | | $\text{tr}(X^TAX)$ | $(A + A^T)X$ | Same as $X$ | | $\text{tr}(XAX^T)$ | $X(A + A^T)$ | Same as $X$ | | $\log\det(X)$ | $X^{-T} = (X^T)^{-1}$ | Same as $X$ | | $\log\det(X)$ (X symmetric) | $X^{-1}$ | Same as $X$ | | $\det(X)$ | $\det(X) X^{-T}$ | Same as $X$ | | $\text{tr}(X^{-1}A)$ | $-X^{-T}AX^{-T}$ | Same as $X$ | ### Second Derivatives (Hessians) | Expression | Hessian $\frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{x}^T}$ | |------------|--------------------------------------------------------------------------| | $\mathbf{a}^T\mathbf{x}$ | $0$ | | $\mathbf{x}^T A \mathbf{x}$ | $A + A^T$ | | $\mathbf{x}^T A \mathbf{x}$ (A symmetric) | $2A$ | | $\|A\mathbf{x} - \mathbf{b}\|^2$ | $2A^TA$ | --- ## Section 3: Probability Distribution Summary All distributions use the following column order: Parameters | Support | PDF/PMF | Mean | Variance | MGF | Notes. ### Discrete Distributions **Bernoulli($p$)** | Item | Value | |------|-------| | Parameters | $p \in [0,1]$ | | Support | $\{0, 1\}$ | | PMF | $P(X=k) = p^k(1-p)^{1-k}$ | | Mean | $p$ | | Variance | $p(1-p)$ | | MGF | $1 - p + pe^t$ | **Binomial($n, p$)** | Item | Value | |------|-------| | Parameters | $n \in \mathbb{N}$, $p \in [0,1]$ | | Support | $\{0, 1, \ldots, n\}$ | | PMF | $\binom{n}{k}p^k(1-p)^{n-k}$ | | Mean | $np$ | | Variance | $np(1-p)$ | | MGF | $(1 - p + pe^t)^n$ | **Geometric($p$)** | Item | Value | |------|-------| | Parameters | $p \in (0,1]$ | | Support | $\{1, 2, 3, \ldots\}$ | | PMF | $(1-p)^{k-1}p$ | | Mean | $1/p$ | | Variance | $(1-p)/p^2$ | | Notes | Number of trials until first success | **Poisson($\lambda$)** | Item | Value | |------|-------| | Parameters | $\lambda > 0$ | | Support | $\{0, 1, 2, \ldots\}$ | | PMF | $e^{-\lambda}\lambda^k / k!$ | | Mean | $\lambda$ | | Variance | $\lambda$ | | MGF | $e^{\lambda(e^t - 1)}$ | | Notes | Limit of Binomial as $n \to \infty$, $np \to \lambda$ | **Negative Binomial($r, p$)** | Item | Value | |------|-------| | Parameters | $r > 0$, $p \in (0,1)$ | | Support | $\{0, 1, 2, \ldots\}$ | | PMF | $\binom{k+r-1}{k}p^r(1-p)^k$ | | Mean | $r(1-p)/p$ | | Variance | $r(1-p)/p^2$ | | Notes | $k$ failures before $r$-th success; overdispersed count model | ### Continuous Distributions **Uniform($a, b$)** | Item | Value | |------|-------| | Parameters | $a < b$ | | Support | $[a, b]$ | | PDF | $1/(b-a)$ | | Mean | $(a+b)/2$ | | Variance | $(b-a)^2/12$ | | MGF | $(e^{tb} - e^{ta})/(t(b-a))$ | **Normal($\mu, \sigma^2$)** | Item | Value | |------|-------| | Parameters | $\mu \in \mathbb{R}$, $\sigma^2 > 0$ | | Support | $\mathbb{R}$ | | PDF | $\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$ | | Mean | $\mu$ | | Variance | $\sigma^2$ | | MGF | $\exp(\mu t + \sigma^2 t^2/2)$ | | Notes | If $X \sim N(\mu,\sigma^2)$, then $(X-\mu)/\sigma \sim N(0,1)$ | **Standard Normal** | Item | Value | |------|-------| | PDF | $\phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ | | CDF | $\Phi(x) = \int_{-\infty}^x \phi(t)\,dt$ | | Mean | $0$ | | Variance | $1$ | | Key quantiles | $\Phi(1.645) = 0.95$, $\Phi(1.96) = 0.975$, $\Phi(2.576) = 0.995$ | **Exponential($\lambda$)** | Item | Value | |------|-------| | Parameters | $\lambda > 0$ (rate) | | Support | $[0, \infty)$ | | PDF | $\lambda e^{-\lambda x}$ | | Mean | $1/\lambda$ | | Variance | $1/\lambda^2$ | | MGF | $\lambda/(\lambda - t)$ for $t < \lambda$ | | Notes | Memoryless; hazard rate $= \lambda$ (constant) | **Gamma($\alpha, \beta$)** | Item | Value | |------|-------| | Parameters | $\alpha > 0$ (shape), $\beta > 0$ (rate) | | Support | $(0, \infty)$ | | PDF | $\frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x}$ | | Mean | $\alpha/\beta$ | | Variance | $\alpha/\beta^2$ | | MGF | $(\beta/(\beta-t))^\alpha$ for $t < \beta$ | | Notes | Exponential $= \text{Gamma}(1, \lambda)$; $\chi^2(k) = \text{Gamma}(k/2, 1/2)$ | **Beta($\alpha, \beta$)** | Item | Value | |------|-------| | Parameters | $\alpha, \beta > 0$ | | Support | $[0, 1]$ | | PDF | $\frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}$ where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ | | Mean | $\alpha/(\alpha+\beta)$ | | Variance | $\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$ | | Notes | Conjugate prior for Bernoulli/Binomial; $\text{Uniform}(0,1) = \text{Beta}(1,1)$ | **Chi-squared($k$)** | Item | Value | |------|-------| | Parameters | $k > 0$ (degrees of freedom) | | Support | $(0, \infty)$ | | PDF | $\text{Gamma}(k/2, 1/2)$ distribution | | Mean | $k$ | | Variance | $2k$ | | Notes | If $Z_i \stackrel{iid}{\sim} N(0,1)$, then $\sum_{i=1}^k Z_i^2 \sim \chi^2(k)$ | **Student's $t(k)$** | Item | Value | |------|-------| | Parameters | $k > 0$ (degrees of freedom) | | Support | $\mathbb{R}$ | | PDF | $\frac{\Gamma((k+1)/2)}{\sqrt{k\pi}\,\Gamma(k/2)}\left(1 + \frac{x^2}{k}\right)^{-(k+1)/2}$ | | Mean | $0$ (for $k > 1$) | | Variance | $k/(k-2)$ (for $k > 2$) | | Notes | $t(k) \to N(0,1)$ as $k \to \infty$; heavier tails for small $k$ | **$F(m, n)$** | Item | Value | |------|-------| | Parameters | $m, n > 0$ (numerator/denominator df) | | Support | $(0, \infty)$ | | Mean | $n/(n-2)$ for $n > 2$ | | Variance | $\frac{2n^2(m+n-2)}{m(n-2)^2(n-4)}$ for $n > 4$ | | Notes | If $U \sim \chi^2(m)$, $V \sim \chi^2(n)$ independent: $F = (U/m)/(V/n) \sim F(m,n)$ | **Cauchy($x_0, \gamma$)** | Item | Value | |------|-------| | Parameters | $x_0 \in \mathbb{R}$ (location), $\gamma > 0$ (scale) | | Support | $\mathbb{R}$ | | PDF | $\frac{1}{\pi\gamma}\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]^{-1}$ | | Mean | Undefined | | Variance | Undefined | | Notes | $t(1)$ distribution; no moments; ratio of two independent normals | **Laplace($\mu, b$)** | Item | Value | |------|-------| | Parameters | $\mu \in \mathbb{R}$ (location), $b > 0$ (scale) | | Support | $\mathbb{R}$ | | PDF | $\frac{1}{2b}\exp\!\left(-\frac{|x-\mu|}{b}\right)$ | | Mean | $\mu$ | | Variance | $2b^2$ | | Notes | Prior for LASSO ($b = 1/\lambda$); equivalent to L1 regularization in MAP estimation | **Multivariate Normal($\boldsymbol\mu, \Sigma$)** | Item | Value | |------|-------| | Parameters | $\boldsymbol\mu \in \mathbb{R}^p$, $\Sigma \in \mathbb{S}^p_{++}$ | | Support | $\mathbb{R}^p$ | | PDF | $(2\pi)^{-p/2}\det(\Sigma)^{-1/2}\exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol\mu)^T\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu)\right)$ | | Mean | $\boldsymbol\mu$ | | Covariance | $\Sigma$ | | MGF | $\exp(\boldsymbol\mu^T\mathbf{t} + \frac{1}{2}\mathbf{t}^T\Sigma\mathbf{t})$ | | Notes | Marginals and conditionals are also normal | **Dirichlet($\boldsymbol\alpha$)** | Item | Value | |------|-------| | Parameters | $\boldsymbol\alpha = (\alpha_1, \ldots, \alpha_K)$, $\alpha_k > 0$ | | Support | Simplex $\{\mathbf{p} : p_k \geq 0, \sum_k p_k = 1\}$ | | PDF | $\frac{\Gamma(\alpha_0)}{\prod_k \Gamma(\alpha_k)}\prod_k p_k^{\alpha_k - 1}$ where $\alpha_0 = \sum_k\alpha_k$ | | Mean | $\alpha_k/\alpha_0$ for each component | | Notes | Conjugate prior for Categorical/Multinomial; generalization of Beta | **Wishart($V, n$)** | Item | Value | |------|-------| | Parameters | $V \in \mathbb{S}^p_{++}$ (scale), $n > p-1$ (df) | | Support | $\mathbb{S}^p_{++}$ | | Mean | $nV$ | | Notes | Distribution of $\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^T$ for $\mathbf{x}_i \stackrel{iid}{\sim} N(0, V)$; matrix generalization of $\chi^2$; conjugate prior for precision matrix | --- ## Section 4: OLS Cheat Sheet ### Setup $$ y = X\beta + \varepsilon, \quad y \in \mathbb{R}^n,\ X \in \mathbb{R}^{n \times k},\ \beta \in \mathbb{R}^k,\ \varepsilon \in \mathbb{R}^n $$ ### Core Formulas | Quantity | Formula | Notes | |----------|---------|-------| | OLS estimator | $\hat{\beta} = (X^TX)^{-1}X^Ty$ | Requires $\text{rank}(X) = k$ | | Fitted values | $\hat{y} = X\hat{\beta} = P_X y$ | | | Residuals | $\hat{\varepsilon} = y - \hat{y} = M_X y$ | $M_X = I - P_X$ | | Projection matrix | $P_X = X(X^TX)^{-1}X^T$ | Symmetric, idempotent: $P_X^2 = P_X$ | | Annihilator | $M_X = I - P_X$ | Symmetric, idempotent: $M_X^2 = M_X$ | | Error variance | $\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)$ | Unbiased estimator of $\sigma^2$ | | Var of $\hat{\beta}$ | $\text{Var}(\hat{\beta}|X) = \sigma^2(X^TX)^{-1}$ | Under homoskedasticity | | Std error of $\hat{\beta}_j$ | $\hat{\text{se}}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(X^TX)^{-1}]_{jj}}$ | | ### Goodness of Fit | Quantity | Formula | |----------|---------| | SSR | $\hat{\varepsilon}^T\hat{\varepsilon} = y^TM_X y$ | | SST | $y^T M_\iota y = \sum_i(y_i - \bar{y})^2$ | | SSE | $\text{SST} - \text{SSR}$ | | $R^2$ | $1 - \text{SSR}/\text{SST} = \text{SSE}/\text{SST}$ | | $\bar{R}^2$ | $1 - (1-R^2)(n-1)/(n-k)$ | ### Test Statistics | Test | Statistic | Distribution under $H_0$ | |------|-----------|--------------------------| | $t$-test: $H_0: \beta_j = 0$ | $t_j = \hat{\beta}_j / \hat{\text{se}}(\hat{\beta}_j)$ | $t(n-k)$ | | $F$-test: $H_0: R\beta = r$ ($q$ restrictions) | $F = \frac{(R\hat{\beta}-r)^T[R(X^TX)^{-1}R^T]^{-1}(R\hat{\beta}-r)/q}{\hat{\sigma}^2}$ | $F(q, n-k)$ | | $F$-test (model vs restricted) | $F = \frac{(\text{SSR}_0 - \text{SSR}_1)/q}{\text{SSR}_1/(n-k)}$ | $F(q, n-k)$ | | Wald statistic | $W = (R\hat{\beta} - r)^T[\hat{V}(R\hat{\beta})]^{-1}(R\hat{\beta} - r)$ | $\chi^2(q)$ asymptotically | ### Estimator Generalizations | Estimator | Formula | When to use | |-----------|---------|-------------| | GLS | $\hat{\beta}_{GLS} = (X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}y$ | Known $\Omega = \text{Var}(\varepsilon|X)/\sigma^2$ | | FGLS | Replace $\Omega$ with $\hat{\Omega}$ | Estimated $\Omega$ (feasible) | | IV | $\hat{\beta}_{IV} = (Z^TX)^{-1}Z^Ty$ | One instrument per regressor ($\ell = k$) | | 2SLS | $\hat{\beta}_{2SLS} = (X^TP_ZX)^{-1}X^TP_Zy$ | Overidentified ($\ell \geq k$) | | GMM | $\hat{\beta}_{GMM} = (X^TZW Z^TX)^{-1}X^TZWZ^Ty$ | Efficient IV with weight matrix $W$ | --- ## Section 5: Optimization Cheat Sheet ### Unconstrained Optimization | Condition | Formula | Interpretation | |-----------|---------|----------------| | First-order necessary (FOC) | $\nabla f(x^*) = 0$ | Zero gradient at optimum | | Second-order sufficient (SOC) — minimum | $\nabla^2 f(x^*) \succ 0$ | Positive definite Hessian | | Second-order sufficient (SOC) — maximum | $\nabla^2 f(x^*) \prec 0$ | Negative definite Hessian | | Saddle point | $\nabla^2 f(x^*)$ indefinite | Neither min nor max | ### Constrained Optimization — Equality Constraints **Problem:** $\min_x f(x)$ subject to $g(x) = 0$ **Lagrangian:** $\mathcal{L}(x, \lambda) = f(x) - \lambda^T g(x)$ **FOC:** $\nabla_x \mathcal{L} = 0$ and $\nabla_\lambda \mathcal{L} = 0$, i.e., $\nabla f(x) = \lambda^T \nabla g(x)$ and $g(x) = 0$. ### KKT Conditions — Inequality Constraints **Problem:** $\min_x f(x)$ subject to $g_i(x) \leq 0$, $i = 1, \ldots, m$, and $h_j(x) = 0$, $j = 1, \ldots, p$. **KKT conditions (necessary for optimality):** $$ \nabla f(x^*) + \sum_i \mu_i \nabla g_i(x^*) + \sum_j \lambda_j \nabla h_j(x^*) = 0 \quad \text{(stationarity)} $$ $$ g_i(x^*) \leq 0 \quad \forall i \quad \text{(primal feasibility)} $$ $$ h_j(x^*) = 0 \quad \forall j \quad \text{(primal feasibility)} $$ $$ \mu_i \geq 0 \quad \forall i \quad \text{(dual feasibility)} $$ $$ \mu_i g_i(x^*) = 0 \quad \forall i \quad \text{(complementary slackness)} $$ ### Duality | Concept | Formula | |---------|---------| | Dual function | $d(\lambda, \mu) = \inf_x \mathcal{L}(x, \lambda, \mu)$ | | Weak duality | $d(\lambda, \mu) \leq p^*$ always | | Strong duality | $d(\lambda^*, \mu^*) = p^*$ (holds under Slater's condition for convex problems) | | Duality gap | $p^* - d^*$ | ### Regularized Regression Formulas | Method | Estimator | Closed-form | |--------|-----------|------------| | Ridge ($L^2$) | $\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|^2$ | $\hat{\beta}_\lambda = (X^TX + \lambda I)^{-1}X^Ty$ | | LASSO ($L^1$) | $\min_\beta \|y - X\beta\|^2 + \lambda\|\beta\|_1$ | No closed form; coordinate descent | | LASSO solution | Soft-threshold | $\hat{\beta}_j = \text{sign}(\hat{\beta}_j^{OLS})(|\hat{\beta}_j^{OLS}| - \lambda/2)_+$ | | Elastic Net | $\min_\beta \|y-X\beta\|^2 + \alpha\lambda\|\beta\|_1 + (1-\alpha)\lambda\|\beta\|^2$ | No closed form | **Soft-threshold operator:** $S(z, \gamma) = \text{sign}(z)(|z| - \gamma)_+ = \begin{cases} z - \gamma & z > \gamma \\ 0 & |z| \leq \gamma \\ z + \gamma & z < -\gamma \end{cases}$ ### Gradient Descent Variants | Method | Update Rule | Notes | |--------|-------------|-------| | Gradient descent | $x_{t+1} = x_t - \eta \nabla f(x_t)$ | Full gradient | | SGD | $x_{t+1} = x_t - \eta \nabla f_i(x_t)$ | Single sample $i$ | | Mini-batch SGD | $x_{t+1} = x_t - \eta \frac{1}{|B|}\sum_{i \in B}\nabla f_i(x_t)$ | Batch $B$ | | Momentum | $v_{t+1} = \gamma v_t + \eta \nabla f(x_t)$; $x_{t+1} = x_t - v_{t+1}$ | $\gamma \approx 0.9$ | | Newton | $x_{t+1} = x_t - [\nabla^2 f(x_t)]^{-1}\nabla f(x_t)$ | Quadratic convergence | --- ## Section 6: Information Theory Quick Reference | Quantity | Formula | Notes | |----------|---------|-------| | Entropy | $H(X) = -\sum_x p(x)\log p(x)$ | $= -E[\log p(X)]$; bits if log base 2, nats if natural log | | Differential entropy | $h(X) = -\int p(x)\log p(x)\,dx$ | Can be negative | | Conditional entropy | $H(X|Y) = -\sum_{x,y} p(x,y)\log p(x|y)$ | $= H(X,Y) - H(Y)$ | | Joint entropy | $H(X,Y) = -\sum_{x,y} p(x,y)\log p(x,y)$ | | | Mutual information | $I(X;Y) = H(X) - H(X|Y)$ | $= H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$ | | KL divergence | $D_{KL}(P \| Q) = \sum_x p(x)\log\frac{p(x)}{q(x)}$ | $\geq 0$; not symmetric | | Cross-entropy | $H(P, Q) = -\sum_x p(x)\log q(x)$ | $= H(P) + D_{KL}(P\|Q)$ | | Binary cross-entropy | $L = -y\log\hat{p} - (1-y)\log(1-\hat{p})$ | Standard logistic regression loss | ### Entropy of Common Distributions | Distribution | Entropy | |-------------|---------| | Bernoulli($p$) | $-p\log p - (1-p)\log(1-p)$ | | $\text{Uniform}\{1,\ldots,n\}$ | $\log n$ | | $N(\mu, \sigma^2)$ | $\frac{1}{2}\log(2\pi e \sigma^2)$ | | $N(\boldsymbol\mu, \Sigma)$ | $\frac{1}{2}\log\det(2\pi e\,\Sigma)$ | | $\text{Exponential}(\lambda)$ | $1 - \log\lambda$ | ### KL Divergence Between Gaussians $$ D_{KL}\!\left(N(\mu_1, \Sigma_1) \| N(\mu_2, \Sigma_2)\right) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\mu_2-\mu_1)^T\Sigma_2^{-1}(\mu_2-\mu_1) - p + \log\frac{\det\Sigma_2}{\det\Sigma_1}\right] $$ --- ## Section 7: Key Theorems Quick Reference | Theorem | Statement | Where It's Used | |---------|-----------|-----------------| | **Gauss-Markov** | Under classical OLS assumptions, $\hat{\beta}_{OLS}$ is BLUE: $\text{Var}(\hat{\beta}_{OLS}) \preceq \text{Var}(\tilde{\beta})$ for any linear unbiased $\tilde{\beta}$ | Justifies OLS when $\text{Var}(\varepsilon|X) = \sigma^2 I$ | | **Frisch-Waugh-Lovell (FWL)** | In $y = X_1\beta_1 + X_2\beta_2 + \varepsilon$, $\hat{\beta}_2$ from regressing $M_{X_1}y$ on $M_{X_1}X_2$ | Within estimator (FE), partialling out regressors | | **Rank-Nullity** | $\text{rank}(A) + \text{nullity}(A) = n$ for $A \in \mathbb{R}^{m \times n}$ | Degrees of freedom, dimensionality analysis | | **Eckart-Young-Mirsky** | Best rank-$k$ approximation of $A$ in spectral/$F$-norm is $\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i\mathbf{v}_i^T$ (truncated SVD) | PCA, low-rank approximation, compression | | **Universal Approximation** | A single hidden layer NN with enough neurons can approximate any continuous function on a compact set | Theoretical basis for neural networks | | **Central Limit Theorem (CLT)** | $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ as $n \to \infty$ | Asymptotic inference, hypothesis testing, OLS asymptotics | | **Law of Large Numbers (LLN)** | $\bar{X}_n \xrightarrow{p} \mu$ as $n \to \infty$ | Consistency of estimators, justifying probability limits | | **Delta Method** | If $\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ is differentiable, then $\sqrt{n}(g(\hat{\theta}) - g(\theta_0)) \xrightarrow{d} N(0, [g'(\theta_0)]^T V g'(\theta_0))$ | Asymptotic distribution of nonlinear transformations | | **Slutsky's Theorem** | If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant), then $X_n + Y_n \xrightarrow{d} X + c$ and $X_n Y_n \xrightarrow{d} cX$ | Combining convergence results in asymptotic proofs | | **Continuous Mapping Theorem** | If $X_n \xrightarrow{d} X$ and $g$ is continuous, then $g(X_n) \xrightarrow{d} g(X)$ | Deriving distributions of test statistics | | **Jensen's Inequality** | For convex $\phi$: $\phi(E[X]) \leq E[\phi(X)]$ | Information theory proofs, ELBO derivation, EM algorithm | | **Bayes' Theorem** | $P(\theta|x) = \frac{P(x|\theta)P(\theta)}{P(x)}$, i.e., posterior $\propto$ likelihood $\times$ prior | Bayesian inference, MAP estimation | | **Cauchy-Schwarz** | $|E[XY]|^2 \leq E[X^2]E[Y^2]$; equivalently $|\mathbf{x}^T\mathbf{y}|^2 \leq \|\mathbf{x}\|^2\|\mathbf{y}\|^2$ | Bounding correlations, proving $|\rho| \leq 1$ | | **Spectral Theorem** | Every real symmetric matrix $A$ has $A = Q\Lambda Q^T$ with orthogonal $Q$ and real diagonal $\Lambda$ | PCA, quadratic forms, covariance analysis | | **Law of Iterated Expectations** | $E[X] = E[E[X|Y]]$ | Simplifying nested expectations; iterated projections | | **Law of Total Variance** | $\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y])$ | Variance decomposition; ANOVA; random effects models | --- *Back to [Appendix Overview](index.qmd) | Previous: [Notation Guide](notation-guide.qmd)*

Quantity	Formula
Trace	\(\text{tr}(A) = \sum_{i=1}^n \lambda_i\)
Determinant	\(\det(A) = \prod_{i=1}^n \lambda_i\)
Eigenvalues of \(A^{-1}\)	\(1/\lambda_i\)
Eigenvalues of \(A^k\)	\(\lambda_i^k\)
Eigenvalues of \(A + cI\)	\(\lambda_i + c\)
Eigenvalues of \(A^T\)	Same as \(A\)

Quantity	Expression
\(\text{rank}(A)\)	Number of nonzero singular values
\(A^+\) (pseudoinverse)	\(V\Sigma^+ U^T\)
\(\\|A\\|_2\) (spectral norm)	\(\sigma_1\) (largest singular value)
\(\\|A\\|_F\) (Frobenius norm)	\(\sqrt{\sum_i \sigma_i^2}\)
\(\\|A\\|_*\) (nuclear norm)	\(\sum_i \sigma_i\)
Eigenvalues of \(A^TA\)	\(\sigma_i^2\)
Best rank-\(k\) approximation	\(\hat{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T\) (Eckart-Young)

Expression	Derivative \(\frac{\partial}{\partial \mathbf{x}}\)
\(\mathbf{a}^T \mathbf{x}\)	\(\mathbf{a}\)
\(\mathbf{x}^T \mathbf{a}\)	\(\mathbf{a}\)
\(\mathbf{x}^T \mathbf{x}\)	\(2\mathbf{x}\)
\(\mathbf{x}^T A \mathbf{x}\)	\((A + A^T)\mathbf{x}\)
\(\mathbf{x}^T A \mathbf{x}\) (A symmetric)	\(2A\mathbf{x}\)
\(\mathbf{a}^T \mathbf{x} \mathbf{x}^T \mathbf{b}\)	\((\mathbf{a}^T\mathbf{x})\mathbf{b} + (\mathbf{b}^T\mathbf{x})\mathbf{a}\)
\(\\|A\mathbf{x} - \mathbf{b}\\|^2\)	\(2A^T(A\mathbf{x} - \mathbf{b})\)
\(\\|A\mathbf{x} - \mathbf{b}\\|^2_W\) (weighted)	\(2A^TW(A\mathbf{x} - \mathbf{b})\)

Expression	Derivative \(\frac{\partial}{\partial X}\)	Shape
\(\mathbf{a}^T X \mathbf{b}\)	\(\mathbf{a}\mathbf{b}^T\)	Same as \(X\)
\(\text{tr}(AX)\)	\(A^T\)	Same as \(X\)
\(\text{tr}(X^TA)\)	\(A\)	Same as \(X\)
\(\text{tr}(AXB)\)	\(A^T(B)^T = (BA)^T\)	Same as \(X\)
\(\text{tr}(AXBX^T)\)	\(B^TX^TA^T + BX^TA\)	Same as \(X\)
\(\text{tr}(X^TAX)\)	\((A + A^T)X\)	Same as \(X\)
\(\text{tr}(XAX^T)\)	\(X(A + A^T)\)	Same as \(X\)
\(\log\det(X)\)	\(X^{-T} = (X^T)^{-1}\)	Same as \(X\)
\(\log\det(X)\) (X symmetric)	\(X^{-1}\)	Same as \(X\)
\(\det(X)\)	\(\det(X) X^{-T}\)	Same as \(X\)
\(\text{tr}(X^{-1}A)\)	\(-X^{-T}AX^{-T}\)	Same as \(X\)

Expression	Hessian \(\frac{\partial^2 f}{\partial \mathbf{x} \partial \mathbf{x}^T}\)
\(\mathbf{a}^T\mathbf{x}\)	\(0\)
\(\mathbf{x}^T A \mathbf{x}\)	\(A + A^T\)
\(\mathbf{x}^T A \mathbf{x}\) (A symmetric)	\(2A\)
\(\\|A\mathbf{x} - \mathbf{b}\\|^2\)	\(2A^TA\)