Mathematical Notation Guide

Panduan Baca Simbol-simbol yang Bikin Pusing

1 Overview

This guide documents the notation conventions used throughout this course. It is organized by domain so you can quickly locate the type of symbol you are looking for. Where this course follows a specific source, it is noted.

Primary references: Hayashi (2000) Econometrics; Bishop (2006) Pattern Recognition and Machine Learning.

2 Section 1: Scalars, Vectors, Matrices, and Variables

The typographic convention determines what kind of object a symbol represents. Learn this once — it applies everywhere.

Convention	Object Type	Examples
Lowercase italic	Scalar (number)	$x$, $\alpha$, $n$, $\lambda$, $t$
Lowercase bold	Column vector	$\mathbf{x}$, $\boldsymbol{\beta}$, $\boldsymbol{\mu}$, $\mathbf{w}$
Uppercase italic or bold	Matrix	$X$, $A$, $\Sigma$, $\mathbf{W}$, $\mathbf{H}$
Uppercase italic	Random variable	$X$, $Y$, $Z$, $U$
Lowercase italic	Realization (observed value)	$x$, $y$, $z$ (same letter, lowercase)
Calligraphic	Set, hypothesis class, space	$\mathcal{X}$, $\mathcal{H}$, $\mathcal{F}$, $\mathcal{D}$
Blackboard bold	Number set, probability, expectation	$\mathbb{R}$, $\mathbb{E}[X]$, $\mathbb{P}(A)$

Warning

Ambiguity alert: In econometrics, $X$ is often used simultaneously as the data matrix (matrix) and as a random variable. Context determines interpretation. When the distinction matters, the data matrix is often written explicitly as a fixed $n \times k$ matrix.

2.1 Indexing Conventions

Notation	Meaning
$x_i$	The $i$-th element of vector $\mathbf{x}$
$X_{ij}$ or $X_{i,j}$	Element in row $i$, column $j$ of matrix $X$
$X_{i \cdot}$ or $x_i^T$	The $i$-th row of $X$ (as a row vector)
$X_{\cdot j}$	The $j$-th column of $X$ (as a column vector)
$x_{it}$	Observation $i$ at time $t$ (panel data)
$\beta_j$	The $j$-th element of parameter vector $\boldsymbol\beta$
$\lambda_i(A)$	The $i$-th eigenvalue of matrix $A$
$\sigma_i(A)$	The $i$-th singular value of matrix $A$
$A_{[i:j, k:l]}$	Submatrix (rows $i$ to $j$, columns $k$ to $l$)

3 Section 2: Linear Algebra Notation

3.1 Matrix Operations

Symbol	Meaning	Notes
$A^T$ or $A'$	Transpose	$(A^T)_{ij} = A_{ji}$; Hayashi uses $'$
$A^{-1}$	Inverse	Exists iff $A$ is square and $\det(A) \neq 0$
$A^{-T}$ or $(A^T)^{-1}$	Inverse transpose	$(A^{-1})^T = (A^T)^{-1}$
$A^{1/2}$	Matrix square root	For PSD $A$: symmetric $B$ s.t. $B^2 = A$
$A^{-1/2}$	Inverse square root	$(A^{1/2})^{-1}$; used in whitening transforms
$A^+$ or $A^\dagger$	Moore-Penrose pseudoinverse	Generalizes inverse to non-square/singular matrices
$\text{tr}(A)$	Trace	$\text{tr}(A) = \sum_i A_{ii} = \sum_i \lambda_i$
$\det(A)$ or $\|A\|$	Determinant	$\det(A) = \prod_i \lambda_i$
$\text{rank}(A)$	Rank	Dimension of column space
$\text{diag}(a_1, \ldots, a_n)$	Diagonal matrix	Square matrix with $a_i$ on diagonal, 0 elsewhere
$\text{diag}(A)$	Diagonal extraction	Column vector of diagonal elements of $A$
$A \otimes B$	Kronecker product	Block matrix; $(A \otimes B)_{(i-1)q+k,(j-1)r+l} = A_{ij}B_{kl}$
$\text{vec}(A)$	Vectorization	Stack columns of $A$ into a single vector
$\text{vech}(A)$	Half-vectorization	Stack lower triangle (for symmetric $A$)

3.2 Subspaces

Symbol	Meaning
$C(A)$ or $\text{col}(A)$	Column space (range) of $A$
$N(A)$ or $\ker(A)$	Null space (kernel) of $A$: $\{x : Ax = 0\}$
$\dim(\mathcal{V})$	Dimension of subspace $\mathcal{V}$
$\mathcal{V}^\perp$	Orthogonal complement of $\mathcal{V}$

3.3 Matrix Norms

Symbol	Name	Formula
$\\|A\\|_F$	Frobenius norm	$\sqrt{\text{tr}(A^TA)} = \sqrt{\sum_{i,j} A_{ij}^2}$
$\\|A\\|_2$ or $\\|A\\|$	Spectral norm (operator 2-norm)	$\sigma_{\max}(A)$, largest singular value
$\\|A\\|_1$	Matrix 1-norm	$\max_j \sum_i \|A_{ij}\|$ (max column sum)
$\\|A\\|_\infty$	Matrix $\infty$-norm	$\max_i \sum_j \|A_{ij}\|$ (max row sum)
$\\|A\\|_*$	Nuclear norm	$\sum_i \sigma_i(A)$, sum of singular values

3.4 Positive Definiteness

Symbol	Meaning	Condition
$A \succ 0$	Positive definite (PD)	$\mathbf{x}^TA\mathbf{x} > 0$ for all $\mathbf{x} \neq 0$
$A \succeq 0$	Positive semidefinite (PSD)	$\mathbf{x}^TA\mathbf{x} \geq 0$ for all $\mathbf{x}$
$A \prec 0$	Negative definite	$\mathbf{x}^TA\mathbf{x} < 0$ for all $\mathbf{x} \neq 0$
$A \preceq 0$	Negative semidefinite	$\mathbf{x}^TA\mathbf{x} \leq 0$ for all $\mathbf{x}$
$A \succeq B$	$A - B \succeq 0$	Used in Gauss-Markov comparisons
$\mathbb{S}^n_+$	PSD cone	Set of all $n \times n$ PSD matrices
$\mathbb{S}^n_{++}$	PD cone	Set of all $n \times n$ PD matrices

4 Section 3: Probability and Statistics Notation

4.1 Basic Probability

Symbol	Meaning	Notes
$P(A)$ or $\Pr(A)$	Probability of event $A$	$P: \mathcal{F} \to [0,1]$
$p(x)$ or $f(x)$	Probability mass/density function	PMF (discrete) or PDF (continuous)
$F(x)$	Cumulative distribution function	$F(x) = P(X \leq x)$
$F^{-1}(q)$	Quantile function	Inverse CDF; $p$-th quantile
$P(A\|B)$	Conditional probability	$P(A\|B) = P(A \cap B)/P(B)$
$p(x\|y)$	Conditional density	Density of $X$ given $Y = y$

4.2 Moments and Parameters

Symbol	Meaning	Alternative
$E[X]$ or $\mathbb{E}[X]$	Expected value	$\mu$, $\mu_X$
$E[X\|Y]$	Conditional expectation	Function of $Y$
$\text{Var}(X)$	Variance	$\sigma^2$, $\sigma_X^2$
$\text{Cov}(X,Y)$	Covariance	$\sigma_{XY}$
$\text{Cor}(X,Y)$	Correlation	$\rho_{XY} = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)$
$M_X(t)$	Moment generating function	$E[e^{tX}]$
$\varphi_X(t)$	Characteristic function	$E[e^{itX}]$

4.3 Distributional Notation

Symbol	Meaning
$X \sim P$	$X$ is distributed according to distribution $P$
$X \sim N(\mu, \sigma^2)$	$X$ is normally distributed
$X \sim N(\boldsymbol\mu, \Sigma)$	$X$ is multivariate normal
$X \stackrel{d}{=} Y$	$X$ and $Y$ are equal in distribution
$X \stackrel{d}{\to} Y$	$X$ converges in distribution to $Y$
$X \perp Y$	$X$ and $Y$ are independent
$X \perp Y \mid Z$	$X$ and $Y$ are conditionally independent given $Z$
$X \mid Y = y$	The random variable $X$ conditioned on $Y = y$

4.4 Key Distributions (Notation Summary)

Symbol	Distribution
$\phi(x)$	Standard normal PDF: $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$
$\Phi(x)$	Standard normal CDF: $\int_{-\infty}^x \phi(t)\,dt$
$N(\mu, \sigma^2)$	Normal with mean $\mu$, variance $\sigma^2$
$N(\boldsymbol\mu, \Sigma)$	Multivariate normal
$\chi^2(k)$ or $\chi^2_k$	Chi-squared with $k$ degrees of freedom
$t(k)$ or $t_k$	Student’s $t$ with $k$ degrees of freedom
$F(m,n)$ or $F_{m,n}$	$F$-distribution with $m, n$ degrees of freedom
$\text{Bernoulli}(p)$	Bernoulli with success probability $p$
$\text{Bin}(n,p)$	Binomial
$\text{Poisson}(\lambda)$	Poisson with rate $\lambda$
$\text{Exp}(\lambda)$	Exponential with rate $\lambda$
$\text{Gamma}(\alpha, \beta)$	Gamma (shape $\alpha$, rate $\beta$)
$\text{Beta}(\alpha, \beta)$	Beta distribution

5 Section 4: Convergence Notation

Essential for asymptotics in econometrics.

Symbol	Name	Meaning
$X_n \xrightarrow{p} X$	Convergence in probability	$P(\|X_n - X\| > \varepsilon) \to 0$ for all $\varepsilon > 0$
$X_n \xrightarrow{d} X$	Convergence in distribution	$F_{X_n}(x) \to F_X(x)$ at continuity points
$X_n \xrightarrow{a.s.} X$	Almost sure convergence	$P(\lim_{n\to\infty} X_n = X) = 1$
$X_n \xrightarrow{L^2} X$	Mean square convergence	$E[(X_n - X)^2] \to 0$
$X_n \xrightarrow{L^p} X$	$L^p$ convergence	$E[\|X_n - X\|^p] \to 0$
$\text{plim}_{n\to\infty} X_n = X$	Probability limit	Same as $\xrightarrow{p}$; Hayashi notation
$o_p(1)$	Converges to 0 in prob.	$X_n = o_p(1)$ means $X_n \xrightarrow{p} 0$
$o_p(a_n)$	Little-o in probability	$X_n / a_n \xrightarrow{p} 0$
$O_p(1)$	Bounded in probability	$\forall \varepsilon > 0$, $\exists M < \infty$: $\limsup_n P(\|X_n\| > M) < \varepsilon$
$O_p(a_n)$	Big-O in probability	$X_n / a_n = O_p(1)$
$\sqrt{n}$-consistent	Rate of convergence	$\sqrt{n}(\hat{\theta} - \theta) = O_p(1)$; standard for MLE/OLS

Hierarchy (stronger to weaker): a.s. $\Rightarrow$ $L^2$ $\Rightarrow$ $p$ $\Rightarrow$ $d$.

6 Section 5: Calculus and Optimization Notation

6.1 Differential Calculus

Symbol	Meaning	Notes
$\nabla f(\mathbf{x})$	Gradient	Column vector: $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)^T$
$\nabla^2 f(\mathbf{x})$ or $H_f(\mathbf{x})$	Hessian matrix	$(H)_{ij} = \partial^2 f / \partial x_i \partial x_j$
$\frac{\partial f}{\partial \mathbf{x}}$	Jacobian (row gradient)	Row vector $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)$; convention varies
$Df(\mathbf{x})$	Jacobian (general)	$m \times n$ matrix for $f: \mathbb{R}^n \to \mathbb{R}^m$
$\frac{\partial f}{\partial X}$	Matrix derivative	Matrix of same shape as $X$
$\frac{d}{dx}$	Total derivative	For scalar functions of a scalar
$\frac{\partial}{\partial x}$	Partial derivative	Treating all other variables as constant
$\dot{x}$ or $x'$	Time derivative	$dx/dt$ in dynamic systems

6.2 Optimization

Symbol	Meaning
$\text{argmin}_{x} f(x)$	Minimizer: $x^* = \text{argmin}_x f(x)$ means $f(x^*) \leq f(x)$ for all feasible $x$
$\text{argmax}_{x} f(x)$	Maximizer
$f^* = \min_x f(x)$	Minimum value of $f$
$x^*$	Optimal solution
$\mathcal{L}(x, \lambda)$	Lagrangian: $f(x) - \lambda^T g(x)$
$\lambda^*$	Optimal Lagrange multiplier
$g(x) \leq 0$	Inequality constraint
$h(x) = 0$	Equality constraint

6.3 Vector Norms

Symbol	Name	Formula
$\\|\mathbf{x}\\|$ or $\\|\mathbf{x}\\|_2$	Euclidean / $L^2$ norm	$\sqrt{\sum_i x_i^2}$
$\\|\mathbf{x}\\|_1$	$L^1$ norm / Manhattan	$\sum_i \|x_i\|$
$\\|\mathbf{x}\\|_p$	$L^p$ norm	$\left(\sum_i \|x_i\|^p\right)^{1/p}$
$\\|\mathbf{x}\\|_\infty$	$L^\infty$ / Chebyshev norm	$\max_i \|x_i\|$
$\\|\mathbf{x}\\|_A$	$A$-weighted norm	$\sqrt{\mathbf{x}^T A \mathbf{x}}$ (for $A \succ 0$)

7 Section 6: Machine Learning Specific Notation

Symbol	Meaning	Notes
$\mathcal{L}(\theta)$ or $L(\theta)$	Loss function	Sometimes $\ell$ for log-likelihood
$\mathcal{L}(\theta; \mathcal{D})$	Loss given dataset $\mathcal{D}$	Explicit data dependence
$\hat{y}$	Predicted value	$\hat{y} = f(\mathbf{x}; \theta)$
$\mathbb{1}[\cdot]$ or $\mathbf{1}[\cdot]$	Indicator function	$\mathbb{1}[A] = 1$ if $A$ true, 0 otherwise
$W^{[l]}$	Weight matrix of layer $l$	Superscript in brackets for layer index
$b^{[l]}$	Bias vector of layer $l$
$a^{[l]}$	Activation of layer $l$	$a^{[l]} = \sigma(W^{[l]}a^{[l-1]} + b^{[l]})$
$\odot$	Hadamard (element-wise) product	$(A \odot B)_{ij} = A_{ij} B_{ij}$
$\sigma(\cdot)$	Sigmoid function	$\sigma(x) = 1/(1+e^{-x})$; also used for std dev
$D_{KL}(P \\| Q)$	KL divergence from $Q$ to $P$	$\sum_x p(x)\log(p(x)/q(x))$
$H(P)$	Entropy of $P$	$-\sum_x p(x)\log p(x)$
$H(P, Q)$	Cross-entropy	$H(P) + D_{KL}(P\\|Q)$
$I(X; Y)$	Mutual information	$H(X) - H(X\|Y)$
$\mathcal{D}$	Dataset	$\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$
$\mathcal{H}$	Hypothesis class	Set of candidate functions
$f_\theta$	Parameterized function	Neural network / model with parameters $\theta$
$\phi(\mathbf{x})$	Feature map	$\phi: \mathcal{X} \to \mathcal{F}$; used in kernel methods
$k(\mathbf{x}, \mathbf{x}')$	Kernel function	$k(\mathbf{x}, \mathbf{x}') = \langle\phi(\mathbf{x}), \phi(\mathbf{x}')\rangle$

8 Section 7: Econometrics Specific Notation

Symbol	Meaning	Notes
$\hat{\beta}$	OLS/estimator of $\beta$	Hat = estimated from data
$\tilde{\beta}$	Some other estimator of $\beta$	E.g., GLS, IV, restricted estimator
$\bar{\beta}$	Infeasible/pseudo estimator	Rarely used
$\beta_0$	True parameter value	Population value under the DGP
$\text{plim}$	Probability limit	$\text{plim}_{n\to\infty}\hat{\beta}$ as $n \to \infty$
$\mathbf{1}_n$ or $\iota_n$	$n$-vector of ones	$\iota_n = (1, 1, \ldots, 1)^T \in \mathbb{R}^n$
$M_X$	Annihilator / residual maker	$M_X = I_n - X(X^TX)^{-1}X^T$; $M_X y = \hat{\varepsilon}$
$P_X$	Projection / hat matrix	$P_X = X(X^TX)^{-1}X^T$; $P_X y = \hat{y}$
$H$	Hat matrix	Same as $P_X$; $H_{ii}$: leverage of observation $i$
$\hat{\varepsilon}$ or $\hat{u}$	OLS residuals	$\hat{\varepsilon} = y - X\hat{\beta} = M_X y$
$\varepsilon$ or $u$	True disturbances	$y = X\beta + \varepsilon$; unobserved
$\tilde{x}_{it}$	Within-demeaned variable	$\tilde{x}_{it} = x_{it} - \bar{x}_i$ (panel FE)
$\bar{x}_i$	Within-group mean	$\bar{x}_i = T^{-1}\sum_t x_{it}$
$n$	Number of observations	Cross-sectional units
$T$	Time periods	Panel time dimension
$k$	Number of regressors	Columns of $X$ (including intercept)
$q$	Number of restrictions	In $F$-test: $R\beta = r$ has $q$ rows
$Z$	Instrument matrix	In IV/2SLS: $n \times \ell$ matrix, $\ell \geq k$
$\Omega$	Error covariance matrix	$\text{Var}(\varepsilon \| X) = \sigma^2 \Omega$ in GLS
$\hat{\sigma}^2$	Estimated error variance	$\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)$
$\text{SSR}$	Sum of squared residuals	$\hat{\varepsilon}^T\hat{\varepsilon}$
$\text{SST}$	Total sum of squares	$(y - \bar{y}\iota)^T(y - \bar{y}\iota) = y^TM_\iota y$
$\text{SSE}$	Explained sum of squares	$\text{SST} - \text{SSR}$
$R^2$	Coefficient of determination	$1 - \text{SSR}/\text{SST}$
$\bar{R}^2$	Adjusted $R^2$	$1 - (1-R^2)(n-1)/(n-k)$
$\text{AIC}$	Akaike information criterion	$-2\ell + 2k$
$\text{BIC}$	Bayesian information criterion	$-2\ell + k\ln(n)$

Back to Appendix Overview | Previous: Common Proofs | Next: Formula Cheat Sheets

--- title: "Mathematical Notation Guide" subtitle: "Panduan Baca Simbol-simbol yang Bikin Pusing" --- ## Overview This guide documents the notation conventions used throughout this course. It is organized by domain so you can quickly locate the type of symbol you are looking for. Where this course follows a specific source, it is noted. **Primary references:** Hayashi (2000) *Econometrics*; Bishop (2006) *Pattern Recognition and Machine Learning*. --- ## Section 1: Scalars, Vectors, Matrices, and Variables The typographic convention determines what kind of object a symbol represents. Learn this once — it applies everywhere. | Convention | Object Type | Examples | |------------|-------------|---------| | Lowercase italic | Scalar (number) | $x$, $\alpha$, $n$, $\lambda$, $t$ | | Lowercase bold | Column vector | $\mathbf{x}$, $\boldsymbol{\beta}$, $\boldsymbol{\mu}$, $\mathbf{w}$ | | Uppercase italic or bold | Matrix | $X$, $A$, $\Sigma$, $\mathbf{W}$, $\mathbf{H}$ | | Uppercase italic | Random variable | $X$, $Y$, $Z$, $U$ | | Lowercase italic | Realization (observed value) | $x$, $y$, $z$ (same letter, lowercase) | | Calligraphic | Set, hypothesis class, space | $\mathcal{X}$, $\mathcal{H}$, $\mathcal{F}$, $\mathcal{D}$ | | Blackboard bold | Number set, probability, expectation | $\mathbb{R}$, $\mathbb{E}[X]$, $\mathbb{P}(A)$ | ::: {.callout-warning} **Ambiguity alert:** In econometrics, $X$ is often used simultaneously as the data matrix (matrix) and as a random variable. Context determines interpretation. When the distinction matters, the data matrix is often written explicitly as a fixed $n \times k$ matrix. ::: ### Indexing Conventions | Notation | Meaning | |----------|---------| | $x_i$ | The $i$-th element of vector $\mathbf{x}$ | | $X_{ij}$ or $X_{i,j}$ | Element in row $i$, column $j$ of matrix $X$ | | $X_{i \cdot}$ or $x_i^T$ | The $i$-th row of $X$ (as a row vector) | | $X_{\cdot j}$ | The $j$-th column of $X$ (as a column vector) | | $x_{it}$ | Observation $i$ at time $t$ (panel data) | | $\beta_j$ | The $j$-th element of parameter vector $\boldsymbol\beta$ | | $\lambda_i(A)$ | The $i$-th eigenvalue of matrix $A$ | | $\sigma_i(A)$ | The $i$-th singular value of matrix $A$ | | $A_{[i:j, k:l]}$ | Submatrix (rows $i$ to $j$, columns $k$ to $l$) | --- ## Section 2: Linear Algebra Notation ### Matrix Operations | Symbol | Meaning | Notes | |--------|---------|-------| | $A^T$ or $A'$ | Transpose | $(A^T)_{ij} = A_{ji}$; Hayashi uses $'$ | | $A^{-1}$ | Inverse | Exists iff $A$ is square and $\det(A) \neq 0$ | | $A^{-T}$ or $(A^T)^{-1}$ | Inverse transpose | $(A^{-1})^T = (A^T)^{-1}$ | | $A^{1/2}$ | Matrix square root | For PSD $A$: symmetric $B$ s.t. $B^2 = A$ | | $A^{-1/2}$ | Inverse square root | $(A^{1/2})^{-1}$; used in whitening transforms | | $A^+$ or $A^\dagger$ | Moore-Penrose pseudoinverse | Generalizes inverse to non-square/singular matrices | | $\text{tr}(A)$ | Trace | $\text{tr}(A) = \sum_i A_{ii} = \sum_i \lambda_i$ | | $\det(A)$ or $|A|$ | Determinant | $\det(A) = \prod_i \lambda_i$ | | $\text{rank}(A)$ | Rank | Dimension of column space | | $\text{diag}(a_1, \ldots, a_n)$ | Diagonal matrix | Square matrix with $a_i$ on diagonal, 0 elsewhere | | $\text{diag}(A)$ | Diagonal extraction | Column vector of diagonal elements of $A$ | | $A \otimes B$ | Kronecker product | Block matrix; $(A \otimes B)_{(i-1)q+k,(j-1)r+l} = A_{ij}B_{kl}$ | | $\text{vec}(A)$ | Vectorization | Stack columns of $A$ into a single vector | | $\text{vech}(A)$ | Half-vectorization | Stack lower triangle (for symmetric $A$) | ### Subspaces | Symbol | Meaning | |--------|---------| | $C(A)$ or $\text{col}(A)$ | Column space (range) of $A$ | | $N(A)$ or $\ker(A)$ | Null space (kernel) of $A$: $\{x : Ax = 0\}$ | | $\dim(\mathcal{V})$ | Dimension of subspace $\mathcal{V}$ | | $\mathcal{V}^\perp$ | Orthogonal complement of $\mathcal{V}$ | ### Matrix Norms | Symbol | Name | Formula | |--------|------|---------| | $\|A\|_F$ | Frobenius norm | $\sqrt{\text{tr}(A^TA)} = \sqrt{\sum_{i,j} A_{ij}^2}$ | | $\|A\|_2$ or $\|A\|$ | Spectral norm (operator 2-norm) | $\sigma_{\max}(A)$, largest singular value | | $\|A\|_1$ | Matrix 1-norm | $\max_j \sum_i |A_{ij}|$ (max column sum) | | $\|A\|_\infty$ | Matrix $\infty$-norm | $\max_i \sum_j |A_{ij}|$ (max row sum) | | $\|A\|_*$ | Nuclear norm | $\sum_i \sigma_i(A)$, sum of singular values | ### Positive Definiteness | Symbol | Meaning | Condition | |--------|---------|-----------| | $A \succ 0$ | Positive definite (PD) | $\mathbf{x}^TA\mathbf{x} > 0$ for all $\mathbf{x} \neq 0$ | | $A \succeq 0$ | Positive semidefinite (PSD) | $\mathbf{x}^TA\mathbf{x} \geq 0$ for all $\mathbf{x}$ | | $A \prec 0$ | Negative definite | $\mathbf{x}^TA\mathbf{x} < 0$ for all $\mathbf{x} \neq 0$ | | $A \preceq 0$ | Negative semidefinite | $\mathbf{x}^TA\mathbf{x} \leq 0$ for all $\mathbf{x}$ | | $A \succeq B$ | $A - B \succeq 0$ | Used in Gauss-Markov comparisons | | $\mathbb{S}^n_+$ | PSD cone | Set of all $n \times n$ PSD matrices | | $\mathbb{S}^n_{++}$ | PD cone | Set of all $n \times n$ PD matrices | --- ## Section 3: Probability and Statistics Notation ### Basic Probability | Symbol | Meaning | Notes | |--------|---------|-------| | $P(A)$ or $\Pr(A)$ | Probability of event $A$ | $P: \mathcal{F} \to [0,1]$ | | $p(x)$ or $f(x)$ | Probability mass/density function | PMF (discrete) or PDF (continuous) | | $F(x)$ | Cumulative distribution function | $F(x) = P(X \leq x)$ | | $F^{-1}(q)$ | Quantile function | Inverse CDF; $p$-th quantile | | $P(A|B)$ | Conditional probability | $P(A|B) = P(A \cap B)/P(B)$ | | $p(x|y)$ | Conditional density | Density of $X$ given $Y = y$ | ### Moments and Parameters | Symbol | Meaning | Alternative | |--------|---------|------------| | $E[X]$ or $\mathbb{E}[X]$ | Expected value | $\mu$, $\mu_X$ | | $E[X|Y]$ | Conditional expectation | Function of $Y$ | | $\text{Var}(X)$ | Variance | $\sigma^2$, $\sigma_X^2$ | | $\text{Cov}(X,Y)$ | Covariance | $\sigma_{XY}$ | | $\text{Cor}(X,Y)$ | Correlation | $\rho_{XY} = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)$ | | $M_X(t)$ | Moment generating function | $E[e^{tX}]$ | | $\varphi_X(t)$ | Characteristic function | $E[e^{itX}]$ | ### Distributional Notation | Symbol | Meaning | |--------|---------| | $X \sim P$ | $X$ is distributed according to distribution $P$ | | $X \sim N(\mu, \sigma^2)$ | $X$ is normally distributed | | $X \sim N(\boldsymbol\mu, \Sigma)$ | $X$ is multivariate normal | | $X \stackrel{d}{=} Y$ | $X$ and $Y$ are equal in distribution | | $X \stackrel{d}{\to} Y$ | $X$ converges in distribution to $Y$ | | $X \perp Y$ | $X$ and $Y$ are independent | | $X \perp Y \mid Z$ | $X$ and $Y$ are conditionally independent given $Z$ | | $X \mid Y = y$ | The random variable $X$ conditioned on $Y = y$ | ### Key Distributions (Notation Summary) | Symbol | Distribution | |--------|-------------| | $\phi(x)$ | Standard normal PDF: $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ | | $\Phi(x)$ | Standard normal CDF: $\int_{-\infty}^x \phi(t)\,dt$ | | $N(\mu, \sigma^2)$ | Normal with mean $\mu$, variance $\sigma^2$ | | $N(\boldsymbol\mu, \Sigma)$ | Multivariate normal | | $\chi^2(k)$ or $\chi^2_k$ | Chi-squared with $k$ degrees of freedom | | $t(k)$ or $t_k$ | Student's $t$ with $k$ degrees of freedom | | $F(m,n)$ or $F_{m,n}$ | $F$-distribution with $m, n$ degrees of freedom | | $\text{Bernoulli}(p)$ | Bernoulli with success probability $p$ | | $\text{Bin}(n,p)$ | Binomial | | $\text{Poisson}(\lambda)$ | Poisson with rate $\lambda$ | | $\text{Exp}(\lambda)$ | Exponential with rate $\lambda$ | | $\text{Gamma}(\alpha, \beta)$ | Gamma (shape $\alpha$, rate $\beta$) | | $\text{Beta}(\alpha, \beta)$ | Beta distribution | --- ## Section 4: Convergence Notation Essential for asymptotics in econometrics. | Symbol | Name | Meaning | |--------|------|---------| | $X_n \xrightarrow{p} X$ | Convergence in probability | $P(|X_n - X| > \varepsilon) \to 0$ for all $\varepsilon > 0$ | | $X_n \xrightarrow{d} X$ | Convergence in distribution | $F_{X_n}(x) \to F_X(x)$ at continuity points | | $X_n \xrightarrow{a.s.} X$ | Almost sure convergence | $P(\lim_{n\to\infty} X_n = X) = 1$ | | $X_n \xrightarrow{L^2} X$ | Mean square convergence | $E[(X_n - X)^2] \to 0$ | | $X_n \xrightarrow{L^p} X$ | $L^p$ convergence | $E[|X_n - X|^p] \to 0$ | | $\text{plim}_{n\to\infty} X_n = X$ | Probability limit | Same as $\xrightarrow{p}$; Hayashi notation | | $o_p(1)$ | Converges to 0 in prob. | $X_n = o_p(1)$ means $X_n \xrightarrow{p} 0$ | | $o_p(a_n)$ | Little-o in probability | $X_n / a_n \xrightarrow{p} 0$ | | $O_p(1)$ | Bounded in probability | $\forall \varepsilon > 0$, $\exists M < \infty$: $\limsup_n P(|X_n| > M) < \varepsilon$ | | $O_p(a_n)$ | Big-O in probability | $X_n / a_n = O_p(1)$ | | $\sqrt{n}$-consistent | Rate of convergence | $\sqrt{n}(\hat{\theta} - \theta) = O_p(1)$; standard for MLE/OLS | **Hierarchy (stronger to weaker):** a.s. $\Rightarrow$ $L^2$ $\Rightarrow$ $p$ $\Rightarrow$ $d$. --- ## Section 5: Calculus and Optimization Notation ### Differential Calculus | Symbol | Meaning | Notes | |--------|---------|-------| | $\nabla f(\mathbf{x})$ | Gradient | Column vector: $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)^T$ | | $\nabla^2 f(\mathbf{x})$ or $H_f(\mathbf{x})$ | Hessian matrix | $(H)_{ij} = \partial^2 f / \partial x_i \partial x_j$ | | $\frac{\partial f}{\partial \mathbf{x}}$ | Jacobian (row gradient) | Row vector $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)$; convention varies | | $Df(\mathbf{x})$ | Jacobian (general) | $m \times n$ matrix for $f: \mathbb{R}^n \to \mathbb{R}^m$ | | $\frac{\partial f}{\partial X}$ | Matrix derivative | Matrix of same shape as $X$ | | $\frac{d}{dx}$ | Total derivative | For scalar functions of a scalar | | $\frac{\partial}{\partial x}$ | Partial derivative | Treating all other variables as constant | | $\dot{x}$ or $x'$ | Time derivative | $dx/dt$ in dynamic systems | ### Optimization | Symbol | Meaning | |--------|---------| | $\text{argmin}_{x} f(x)$ | Minimizer: $x^* = \text{argmin}_x f(x)$ means $f(x^*) \leq f(x)$ for all feasible $x$ | | $\text{argmax}_{x} f(x)$ | Maximizer | | $f^* = \min_x f(x)$ | Minimum value of $f$ | | $x^*$ | Optimal solution | | $\mathcal{L}(x, \lambda)$ | Lagrangian: $f(x) - \lambda^T g(x)$ | | $\lambda^*$ | Optimal Lagrange multiplier | | $g(x) \leq 0$ | Inequality constraint | | $h(x) = 0$ | Equality constraint | ### Vector Norms | Symbol | Name | Formula | |--------|------|---------| | $\|\mathbf{x}\|$ or $\|\mathbf{x}\|_2$ | Euclidean / $L^2$ norm | $\sqrt{\sum_i x_i^2}$ | | $\|\mathbf{x}\|_1$ | $L^1$ norm / Manhattan | $\sum_i |x_i|$ | | $\|\mathbf{x}\|_p$ | $L^p$ norm | $\left(\sum_i |x_i|^p\right)^{1/p}$ | | $\|\mathbf{x}\|_\infty$ | $L^\infty$ / Chebyshev norm | $\max_i |x_i|$ | | $\|\mathbf{x}\|_A$ | $A$-weighted norm | $\sqrt{\mathbf{x}^T A \mathbf{x}}$ (for $A \succ 0$) | --- ## Section 6: Machine Learning Specific Notation | Symbol | Meaning | Notes | |--------|---------|-------| | $\mathcal{L}(\theta)$ or $L(\theta)$ | Loss function | Sometimes $\ell$ for log-likelihood | | $\mathcal{L}(\theta; \mathcal{D})$ | Loss given dataset $\mathcal{D}$ | Explicit data dependence | | $\hat{y}$ | Predicted value | $\hat{y} = f(\mathbf{x}; \theta)$ | | $\mathbb{1}[\cdot]$ or $\mathbf{1}[\cdot]$ | Indicator function | $\mathbb{1}[A] = 1$ if $A$ true, 0 otherwise | | $W^{[l]}$ | Weight matrix of layer $l$ | Superscript in brackets for layer index | | $b^{[l]}$ | Bias vector of layer $l$ | | | $a^{[l]}$ | Activation of layer $l$ | $a^{[l]} = \sigma(W^{[l]}a^{[l-1]} + b^{[l]})$ | | $\odot$ | Hadamard (element-wise) product | $(A \odot B)_{ij} = A_{ij} B_{ij}$ | | $\sigma(\cdot)$ | Sigmoid function | $\sigma(x) = 1/(1+e^{-x})$; also used for std dev | | $D_{KL}(P \| Q)$ | KL divergence from $Q$ to $P$ | $\sum_x p(x)\log(p(x)/q(x))$ | | $H(P)$ | Entropy of $P$ | $-\sum_x p(x)\log p(x)$ | | $H(P, Q)$ | Cross-entropy | $H(P) + D_{KL}(P\|Q)$ | | $I(X; Y)$ | Mutual information | $H(X) - H(X|Y)$ | | $\mathcal{D}$ | Dataset | $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ | | $\mathcal{H}$ | Hypothesis class | Set of candidate functions | | $f_\theta$ | Parameterized function | Neural network / model with parameters $\theta$ | | $\phi(\mathbf{x})$ | Feature map | $\phi: \mathcal{X} \to \mathcal{F}$; used in kernel methods | | $k(\mathbf{x}, \mathbf{x}')$ | Kernel function | $k(\mathbf{x}, \mathbf{x}') = \langle\phi(\mathbf{x}), \phi(\mathbf{x}')\rangle$ | --- ## Section 7: Econometrics Specific Notation | Symbol | Meaning | Notes | |--------|---------|-------| | $\hat{\beta}$ | OLS/estimator of $\beta$ | Hat = estimated from data | | $\tilde{\beta}$ | Some other estimator of $\beta$ | E.g., GLS, IV, restricted estimator | | $\bar{\beta}$ | Infeasible/pseudo estimator | Rarely used | | $\beta_0$ | True parameter value | Population value under the DGP | | $\text{plim}$ | Probability limit | $\text{plim}_{n\to\infty}\hat{\beta}$ as $n \to \infty$ | | $\mathbf{1}_n$ or $\iota_n$ | $n$-vector of ones | $\iota_n = (1, 1, \ldots, 1)^T \in \mathbb{R}^n$ | | $M_X$ | Annihilator / residual maker | $M_X = I_n - X(X^TX)^{-1}X^T$; $M_X y = \hat{\varepsilon}$ | | $P_X$ | Projection / hat matrix | $P_X = X(X^TX)^{-1}X^T$; $P_X y = \hat{y}$ | | $H$ | Hat matrix | Same as $P_X$; $H_{ii}$: leverage of observation $i$ | | $\hat{\varepsilon}$ or $\hat{u}$ | OLS residuals | $\hat{\varepsilon} = y - X\hat{\beta} = M_X y$ | | $\varepsilon$ or $u$ | True disturbances | $y = X\beta + \varepsilon$; unobserved | | $\tilde{x}_{it}$ | Within-demeaned variable | $\tilde{x}_{it} = x_{it} - \bar{x}_i$ (panel FE) | | $\bar{x}_i$ | Within-group mean | $\bar{x}_i = T^{-1}\sum_t x_{it}$ | | $n$ | Number of observations | Cross-sectional units | | $T$ | Time periods | Panel time dimension | | $k$ | Number of regressors | Columns of $X$ (including intercept) | | $q$ | Number of restrictions | In $F$-test: $R\beta = r$ has $q$ rows | | $Z$ | Instrument matrix | In IV/2SLS: $n \times \ell$ matrix, $\ell \geq k$ | | $\Omega$ | Error covariance matrix | $\text{Var}(\varepsilon | X) = \sigma^2 \Omega$ in GLS | | $\hat{\sigma}^2$ | Estimated error variance | $\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)$ | | $\text{SSR}$ | Sum of squared residuals | $\hat{\varepsilon}^T\hat{\varepsilon}$ | | $\text{SST}$ | Total sum of squares | $(y - \bar{y}\iota)^T(y - \bar{y}\iota) = y^TM_\iota y$ | | $\text{SSE}$ | Explained sum of squares | $\text{SST} - \text{SSR}$ | | $R^2$ | Coefficient of determination | $1 - \text{SSR}/\text{SST}$ | | $\bar{R}^2$ | Adjusted $R^2$ | $1 - (1-R^2)(n-1)/(n-k)$ | | $\text{AIC}$ | Akaike information criterion | $-2\ell + 2k$ | | $\text{BIC}$ | Bayesian information criterion | $-2\ell + k\ln(n)$ | --- *Back to [Appendix Overview](index.qmd) | Previous: [Common Proofs](common-proofs.qmd) | Next: [Formula Cheat Sheets](cheat-sheets.qmd)*

Notation	Meaning
\(x_i\)	The \(i\)-th element of vector \(\mathbf{x}\)
\(X_{ij}\) or \(X_{i,j}\)	Element in row \(i\), column \(j\) of matrix \(X\)
\(X_{i \cdot}\) or \(x_i^T\)	The \(i\)-th row of \(X\) (as a row vector)
\(X_{\cdot j}\)	The \(j\)-th column of \(X\) (as a column vector)
\(x_{it}\)	Observation \(i\) at time \(t\) (panel data)
\(\beta_j\)	The \(j\)-th element of parameter vector \(\boldsymbol\beta\)
\(\lambda_i(A)\)	The \(i\)-th eigenvalue of matrix \(A\)
\(\sigma_i(A)\)	The \(i\)-th singular value of matrix \(A\)
\(A_{[i:j, k:l]}\)	Submatrix (rows \(i\) to \(j\), columns \(k\) to \(l\))

Symbol	Meaning	Notes
\(A^T\) or \(A'\)	Transpose	\((A^T)_{ij} = A_{ji}\); Hayashi uses \('\)
\(A^{-1}\)	Inverse	Exists iff \(A\) is square and \(\det(A) \neq 0\)
\(A^{-T}\) or \((A^T)^{-1}\)	Inverse transpose	\((A^{-1})^T = (A^T)^{-1}\)
\(A^{1/2}\)	Matrix square root	For PSD \(A\): symmetric \(B\) s.t. \(B^2 = A\)
\(A^{-1/2}\)	Inverse square root	\((A^{1/2})^{-1}\); used in whitening transforms
\(A^+\) or \(A^\dagger\)	Moore-Penrose pseudoinverse	Generalizes inverse to non-square/singular matrices
\(\text{tr}(A)\)	Trace	\(\text{tr}(A) = \sum_i A_{ii} = \sum_i \lambda_i\)
\(\det(A)\) or \(\|A\|\)	Determinant	\(\det(A) = \prod_i \lambda_i\)
\(\text{rank}(A)\)	Rank	Dimension of column space
\(\text{diag}(a_1, \ldots, a_n)\)	Diagonal matrix	Square matrix with \(a_i\) on diagonal, 0 elsewhere
\(\text{diag}(A)\)	Diagonal extraction	Column vector of diagonal elements of \(A\)
\(A \otimes B\)	Kronecker product	Block matrix; \((A \otimes B)_{(i-1)q+k,(j-1)r+l} = A_{ij}B_{kl}\)
\(\text{vec}(A)\)	Vectorization	Stack columns of \(A\) into a single vector
\(\text{vech}(A)\)	Half-vectorization	Stack lower triangle (for symmetric \(A\))

Symbol	Meaning
\(C(A)\) or \(\text{col}(A)\)	Column space (range) of \(A\)
\(N(A)\) or \(\ker(A)\)	Null space (kernel) of \(A\): \(\{x : Ax = 0\}\)
\(\dim(\mathcal{V})\)	Dimension of subspace \(\mathcal{V}\)
\(\mathcal{V}^\perp\)	Orthogonal complement of \(\mathcal{V}\)

Symbol	Name	Formula
\(\\|A\\|_F\)	Frobenius norm	\(\sqrt{\text{tr}(A^TA)} = \sqrt{\sum_{i,j} A_{ij}^2}\)
\(\\|A\\|_2\) or \(\\|A\\|\)	Spectral norm (operator 2-norm)	\(\sigma_{\max}(A)\), largest singular value
\(\\|A\\|_1\)	Matrix 1-norm	\(\max_j \sum_i \|A_{ij}\|\) (max column sum)
\(\\|A\\|_\infty\)	Matrix \(\infty\)-norm	\(\max_i \sum_j \|A_{ij}\|\) (max row sum)
\(\\|A\\|_*\)	Nuclear norm	\(\sum_i \sigma_i(A)\), sum of singular values

Symbol	Meaning	Condition
\(A \succ 0\)	Positive definite (PD)	\(\mathbf{x}^TA\mathbf{x} > 0\) for all \(\mathbf{x} \neq 0\)
\(A \succeq 0\)	Positive semidefinite (PSD)	\(\mathbf{x}^TA\mathbf{x} \geq 0\) for all \(\mathbf{x}\)
\(A \prec 0\)	Negative definite	\(\mathbf{x}^TA\mathbf{x} < 0\) for all \(\mathbf{x} \neq 0\)
\(A \preceq 0\)	Negative semidefinite	\(\mathbf{x}^TA\mathbf{x} \leq 0\) for all \(\mathbf{x}\)
\(A \succeq B\)	\(A - B \succeq 0\)	Used in Gauss-Markov comparisons
\(\mathbb{S}^n_+\)	PSD cone	Set of all \(n \times n\) PSD matrices
\(\mathbb{S}^n_{++}\)	PD cone	Set of all \(n \times n\) PD matrices

Symbol	Meaning	Alternative
\(E[X]\) or \(\mathbb{E}[X]\)	Expected value	\(\mu\), \(\mu_X\)
\(E[X\|Y]\)	Conditional expectation	Function of \(Y\)
\(\text{Var}(X)\)	Variance	\(\sigma^2\), \(\sigma_X^2\)
\(\text{Cov}(X,Y)\)	Covariance	\(\sigma_{XY}\)
\(\text{Cor}(X,Y)\)	Correlation	\(\rho_{XY} = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)\)
\(M_X(t)\)	Moment generating function	\(E[e^{tX}]\)
\(\varphi_X(t)\)	Characteristic function	\(E[e^{itX}]\)

Symbol	Name	Formula
\(\\|\mathbf{x}\\|\) or \(\\|\mathbf{x}\\|_2\)	Euclidean / \(L^2\) norm	\(\sqrt{\sum_i x_i^2}\)
\(\\|\mathbf{x}\\|_1\)	\(L^1\) norm / Manhattan	\(\sum_i \|x_i\|\)
\(\\|\mathbf{x}\\|_p\)	\(L^p\) norm	\(\left(\sum_i \|x_i\|^p\right)^{1/p}\)
\(\\|\mathbf{x}\\|_\infty\)	\(L^\infty\) / Chebyshev norm	\(\max_i \|x_i\|\)
\(\\|\mathbf{x}\\|_A\)	\(A\)-weighted norm	\(\sqrt{\mathbf{x}^T A \mathbf{x}}\) (for \(A \succ 0\))