---
title: "Mathematical Notation Guide"
subtitle: "Panduan Baca Simbol-simbol yang Bikin Pusing"
---
## Overview
This guide documents the notation conventions used throughout this course. It is organized by domain so you can quickly locate the type of symbol you are looking for. Where this course follows a specific source, it is noted.
**Primary references:** Hayashi (2000) *Econometrics*; Bishop (2006) *Pattern Recognition and Machine Learning*.
---
## Section 1: Scalars, Vectors, Matrices, and Variables
The typographic convention determines what kind of object a symbol represents. Learn this once — it applies everywhere.
| Convention | Object Type | Examples |
|------------|-------------|---------|
| Lowercase italic | Scalar (number) | $x$, $\alpha$, $n$, $\lambda$, $t$ |
| Lowercase bold | Column vector | $\mathbf{x}$, $\boldsymbol{\beta}$, $\boldsymbol{\mu}$, $\mathbf{w}$ |
| Uppercase italic or bold | Matrix | $X$, $A$, $\Sigma$, $\mathbf{W}$, $\mathbf{H}$ |
| Uppercase italic | Random variable | $X$, $Y$, $Z$, $U$ |
| Lowercase italic | Realization (observed value) | $x$, $y$, $z$ (same letter, lowercase) |
| Calligraphic | Set, hypothesis class, space | $\mathcal{X}$, $\mathcal{H}$, $\mathcal{F}$, $\mathcal{D}$ |
| Blackboard bold | Number set, probability, expectation | $\mathbb{R}$, $\mathbb{E}[X]$, $\mathbb{P}(A)$ |
::: {.callout-warning}
**Ambiguity alert:** In econometrics, $X$ is often used simultaneously as the data matrix (matrix) and as a random variable. Context determines interpretation. When the distinction matters, the data matrix is often written explicitly as a fixed $n \times k$ matrix.
:::
### Indexing Conventions
| Notation | Meaning |
|----------|---------|
| $x_i$ | The $i$-th element of vector $\mathbf{x}$ |
| $X_{ij}$ or $X_{i,j}$ | Element in row $i$, column $j$ of matrix $X$ |
| $X_{i \cdot}$ or $x_i^T$ | The $i$-th row of $X$ (as a row vector) |
| $X_{\cdot j}$ | The $j$-th column of $X$ (as a column vector) |
| $x_{it}$ | Observation $i$ at time $t$ (panel data) |
| $\beta_j$ | The $j$-th element of parameter vector $\boldsymbol\beta$ |
| $\lambda_i(A)$ | The $i$-th eigenvalue of matrix $A$ |
| $\sigma_i(A)$ | The $i$-th singular value of matrix $A$ |
| $A_{[i:j, k:l]}$ | Submatrix (rows $i$ to $j$, columns $k$ to $l$) |
---
## Section 2: Linear Algebra Notation
### Matrix Operations
| Symbol | Meaning | Notes |
|--------|---------|-------|
| $A^T$ or $A'$ | Transpose | $(A^T)_{ij} = A_{ji}$; Hayashi uses $'$ |
| $A^{-1}$ | Inverse | Exists iff $A$ is square and $\det(A) \neq 0$ |
| $A^{-T}$ or $(A^T)^{-1}$ | Inverse transpose | $(A^{-1})^T = (A^T)^{-1}$ |
| $A^{1/2}$ | Matrix square root | For PSD $A$: symmetric $B$ s.t. $B^2 = A$ |
| $A^{-1/2}$ | Inverse square root | $(A^{1/2})^{-1}$; used in whitening transforms |
| $A^+$ or $A^\dagger$ | Moore-Penrose pseudoinverse | Generalizes inverse to non-square/singular matrices |
| $\text{tr}(A)$ | Trace | $\text{tr}(A) = \sum_i A_{ii} = \sum_i \lambda_i$ |
| $\det(A)$ or $|A|$ | Determinant | $\det(A) = \prod_i \lambda_i$ |
| $\text{rank}(A)$ | Rank | Dimension of column space |
| $\text{diag}(a_1, \ldots, a_n)$ | Diagonal matrix | Square matrix with $a_i$ on diagonal, 0 elsewhere |
| $\text{diag}(A)$ | Diagonal extraction | Column vector of diagonal elements of $A$ |
| $A \otimes B$ | Kronecker product | Block matrix; $(A \otimes B)_{(i-1)q+k,(j-1)r+l} = A_{ij}B_{kl}$ |
| $\text{vec}(A)$ | Vectorization | Stack columns of $A$ into a single vector |
| $\text{vech}(A)$ | Half-vectorization | Stack lower triangle (for symmetric $A$) |
### Subspaces
| Symbol | Meaning |
|--------|---------|
| $C(A)$ or $\text{col}(A)$ | Column space (range) of $A$ |
| $N(A)$ or $\ker(A)$ | Null space (kernel) of $A$: $\{x : Ax = 0\}$ |
| $\dim(\mathcal{V})$ | Dimension of subspace $\mathcal{V}$ |
| $\mathcal{V}^\perp$ | Orthogonal complement of $\mathcal{V}$ |
### Matrix Norms
| Symbol | Name | Formula |
|--------|------|---------|
| $\|A\|_F$ | Frobenius norm | $\sqrt{\text{tr}(A^TA)} = \sqrt{\sum_{i,j} A_{ij}^2}$ |
| $\|A\|_2$ or $\|A\|$ | Spectral norm (operator 2-norm) | $\sigma_{\max}(A)$, largest singular value |
| $\|A\|_1$ | Matrix 1-norm | $\max_j \sum_i |A_{ij}|$ (max column sum) |
| $\|A\|_\infty$ | Matrix $\infty$-norm | $\max_i \sum_j |A_{ij}|$ (max row sum) |
| $\|A\|_*$ | Nuclear norm | $\sum_i \sigma_i(A)$, sum of singular values |
### Positive Definiteness
| Symbol | Meaning | Condition |
|--------|---------|-----------|
| $A \succ 0$ | Positive definite (PD) | $\mathbf{x}^TA\mathbf{x} > 0$ for all $\mathbf{x} \neq 0$ |
| $A \succeq 0$ | Positive semidefinite (PSD) | $\mathbf{x}^TA\mathbf{x} \geq 0$ for all $\mathbf{x}$ |
| $A \prec 0$ | Negative definite | $\mathbf{x}^TA\mathbf{x} < 0$ for all $\mathbf{x} \neq 0$ |
| $A \preceq 0$ | Negative semidefinite | $\mathbf{x}^TA\mathbf{x} \leq 0$ for all $\mathbf{x}$ |
| $A \succeq B$ | $A - B \succeq 0$ | Used in Gauss-Markov comparisons |
| $\mathbb{S}^n_+$ | PSD cone | Set of all $n \times n$ PSD matrices |
| $\mathbb{S}^n_{++}$ | PD cone | Set of all $n \times n$ PD matrices |
---
## Section 3: Probability and Statistics Notation
### Basic Probability
| Symbol | Meaning | Notes |
|--------|---------|-------|
| $P(A)$ or $\Pr(A)$ | Probability of event $A$ | $P: \mathcal{F} \to [0,1]$ |
| $p(x)$ or $f(x)$ | Probability mass/density function | PMF (discrete) or PDF (continuous) |
| $F(x)$ | Cumulative distribution function | $F(x) = P(X \leq x)$ |
| $F^{-1}(q)$ | Quantile function | Inverse CDF; $p$-th quantile |
| $P(A|B)$ | Conditional probability | $P(A|B) = P(A \cap B)/P(B)$ |
| $p(x|y)$ | Conditional density | Density of $X$ given $Y = y$ |
### Moments and Parameters
| Symbol | Meaning | Alternative |
|--------|---------|------------|
| $E[X]$ or $\mathbb{E}[X]$ | Expected value | $\mu$, $\mu_X$ |
| $E[X|Y]$ | Conditional expectation | Function of $Y$ |
| $\text{Var}(X)$ | Variance | $\sigma^2$, $\sigma_X^2$ |
| $\text{Cov}(X,Y)$ | Covariance | $\sigma_{XY}$ |
| $\text{Cor}(X,Y)$ | Correlation | $\rho_{XY} = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)$ |
| $M_X(t)$ | Moment generating function | $E[e^{tX}]$ |
| $\varphi_X(t)$ | Characteristic function | $E[e^{itX}]$ |
### Distributional Notation
| Symbol | Meaning |
|--------|---------|
| $X \sim P$ | $X$ is distributed according to distribution $P$ |
| $X \sim N(\mu, \sigma^2)$ | $X$ is normally distributed |
| $X \sim N(\boldsymbol\mu, \Sigma)$ | $X$ is multivariate normal |
| $X \stackrel{d}{=} Y$ | $X$ and $Y$ are equal in distribution |
| $X \stackrel{d}{\to} Y$ | $X$ converges in distribution to $Y$ |
| $X \perp Y$ | $X$ and $Y$ are independent |
| $X \perp Y \mid Z$ | $X$ and $Y$ are conditionally independent given $Z$ |
| $X \mid Y = y$ | The random variable $X$ conditioned on $Y = y$ |
### Key Distributions (Notation Summary)
| Symbol | Distribution |
|--------|-------------|
| $\phi(x)$ | Standard normal PDF: $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ |
| $\Phi(x)$ | Standard normal CDF: $\int_{-\infty}^x \phi(t)\,dt$ |
| $N(\mu, \sigma^2)$ | Normal with mean $\mu$, variance $\sigma^2$ |
| $N(\boldsymbol\mu, \Sigma)$ | Multivariate normal |
| $\chi^2(k)$ or $\chi^2_k$ | Chi-squared with $k$ degrees of freedom |
| $t(k)$ or $t_k$ | Student's $t$ with $k$ degrees of freedom |
| $F(m,n)$ or $F_{m,n}$ | $F$-distribution with $m, n$ degrees of freedom |
| $\text{Bernoulli}(p)$ | Bernoulli with success probability $p$ |
| $\text{Bin}(n,p)$ | Binomial |
| $\text{Poisson}(\lambda)$ | Poisson with rate $\lambda$ |
| $\text{Exp}(\lambda)$ | Exponential with rate $\lambda$ |
| $\text{Gamma}(\alpha, \beta)$ | Gamma (shape $\alpha$, rate $\beta$) |
| $\text{Beta}(\alpha, \beta)$ | Beta distribution |
---
## Section 4: Convergence Notation
Essential for asymptotics in econometrics.
| Symbol | Name | Meaning |
|--------|------|---------|
| $X_n \xrightarrow{p} X$ | Convergence in probability | $P(|X_n - X| > \varepsilon) \to 0$ for all $\varepsilon > 0$ |
| $X_n \xrightarrow{d} X$ | Convergence in distribution | $F_{X_n}(x) \to F_X(x)$ at continuity points |
| $X_n \xrightarrow{a.s.} X$ | Almost sure convergence | $P(\lim_{n\to\infty} X_n = X) = 1$ |
| $X_n \xrightarrow{L^2} X$ | Mean square convergence | $E[(X_n - X)^2] \to 0$ |
| $X_n \xrightarrow{L^p} X$ | $L^p$ convergence | $E[|X_n - X|^p] \to 0$ |
| $\text{plim}_{n\to\infty} X_n = X$ | Probability limit | Same as $\xrightarrow{p}$; Hayashi notation |
| $o_p(1)$ | Converges to 0 in prob. | $X_n = o_p(1)$ means $X_n \xrightarrow{p} 0$ |
| $o_p(a_n)$ | Little-o in probability | $X_n / a_n \xrightarrow{p} 0$ |
| $O_p(1)$ | Bounded in probability | $\forall \varepsilon > 0$, $\exists M < \infty$: $\limsup_n P(|X_n| > M) < \varepsilon$ |
| $O_p(a_n)$ | Big-O in probability | $X_n / a_n = O_p(1)$ |
| $\sqrt{n}$-consistent | Rate of convergence | $\sqrt{n}(\hat{\theta} - \theta) = O_p(1)$; standard for MLE/OLS |
**Hierarchy (stronger to weaker):** a.s. $\Rightarrow$ $L^2$ $\Rightarrow$ $p$ $\Rightarrow$ $d$.
---
## Section 5: Calculus and Optimization Notation
### Differential Calculus
| Symbol | Meaning | Notes |
|--------|---------|-------|
| $\nabla f(\mathbf{x})$ | Gradient | Column vector: $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)^T$ |
| $\nabla^2 f(\mathbf{x})$ or $H_f(\mathbf{x})$ | Hessian matrix | $(H)_{ij} = \partial^2 f / \partial x_i \partial x_j$ |
| $\frac{\partial f}{\partial \mathbf{x}}$ | Jacobian (row gradient) | Row vector $(\partial f/\partial x_1, \ldots, \partial f/\partial x_n)$; convention varies |
| $Df(\mathbf{x})$ | Jacobian (general) | $m \times n$ matrix for $f: \mathbb{R}^n \to \mathbb{R}^m$ |
| $\frac{\partial f}{\partial X}$ | Matrix derivative | Matrix of same shape as $X$ |
| $\frac{d}{dx}$ | Total derivative | For scalar functions of a scalar |
| $\frac{\partial}{\partial x}$ | Partial derivative | Treating all other variables as constant |
| $\dot{x}$ or $x'$ | Time derivative | $dx/dt$ in dynamic systems |
### Optimization
| Symbol | Meaning |
|--------|---------|
| $\text{argmin}_{x} f(x)$ | Minimizer: $x^* = \text{argmin}_x f(x)$ means $f(x^*) \leq f(x)$ for all feasible $x$ |
| $\text{argmax}_{x} f(x)$ | Maximizer |
| $f^* = \min_x f(x)$ | Minimum value of $f$ |
| $x^*$ | Optimal solution |
| $\mathcal{L}(x, \lambda)$ | Lagrangian: $f(x) - \lambda^T g(x)$ |
| $\lambda^*$ | Optimal Lagrange multiplier |
| $g(x) \leq 0$ | Inequality constraint |
| $h(x) = 0$ | Equality constraint |
### Vector Norms
| Symbol | Name | Formula |
|--------|------|---------|
| $\|\mathbf{x}\|$ or $\|\mathbf{x}\|_2$ | Euclidean / $L^2$ norm | $\sqrt{\sum_i x_i^2}$ |
| $\|\mathbf{x}\|_1$ | $L^1$ norm / Manhattan | $\sum_i |x_i|$ |
| $\|\mathbf{x}\|_p$ | $L^p$ norm | $\left(\sum_i |x_i|^p\right)^{1/p}$ |
| $\|\mathbf{x}\|_\infty$ | $L^\infty$ / Chebyshev norm | $\max_i |x_i|$ |
| $\|\mathbf{x}\|_A$ | $A$-weighted norm | $\sqrt{\mathbf{x}^T A \mathbf{x}}$ (for $A \succ 0$) |
---
## Section 6: Machine Learning Specific Notation
| Symbol | Meaning | Notes |
|--------|---------|-------|
| $\mathcal{L}(\theta)$ or $L(\theta)$ | Loss function | Sometimes $\ell$ for log-likelihood |
| $\mathcal{L}(\theta; \mathcal{D})$ | Loss given dataset $\mathcal{D}$ | Explicit data dependence |
| $\hat{y}$ | Predicted value | $\hat{y} = f(\mathbf{x}; \theta)$ |
| $\mathbb{1}[\cdot]$ or $\mathbf{1}[\cdot]$ | Indicator function | $\mathbb{1}[A] = 1$ if $A$ true, 0 otherwise |
| $W^{[l]}$ | Weight matrix of layer $l$ | Superscript in brackets for layer index |
| $b^{[l]}$ | Bias vector of layer $l$ | |
| $a^{[l]}$ | Activation of layer $l$ | $a^{[l]} = \sigma(W^{[l]}a^{[l-1]} + b^{[l]})$ |
| $\odot$ | Hadamard (element-wise) product | $(A \odot B)_{ij} = A_{ij} B_{ij}$ |
| $\sigma(\cdot)$ | Sigmoid function | $\sigma(x) = 1/(1+e^{-x})$; also used for std dev |
| $D_{KL}(P \| Q)$ | KL divergence from $Q$ to $P$ | $\sum_x p(x)\log(p(x)/q(x))$ |
| $H(P)$ | Entropy of $P$ | $-\sum_x p(x)\log p(x)$ |
| $H(P, Q)$ | Cross-entropy | $H(P) + D_{KL}(P\|Q)$ |
| $I(X; Y)$ | Mutual information | $H(X) - H(X|Y)$ |
| $\mathcal{D}$ | Dataset | $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ |
| $\mathcal{H}$ | Hypothesis class | Set of candidate functions |
| $f_\theta$ | Parameterized function | Neural network / model with parameters $\theta$ |
| $\phi(\mathbf{x})$ | Feature map | $\phi: \mathcal{X} \to \mathcal{F}$; used in kernel methods |
| $k(\mathbf{x}, \mathbf{x}')$ | Kernel function | $k(\mathbf{x}, \mathbf{x}') = \langle\phi(\mathbf{x}), \phi(\mathbf{x}')\rangle$ |
---
## Section 7: Econometrics Specific Notation
| Symbol | Meaning | Notes |
|--------|---------|-------|
| $\hat{\beta}$ | OLS/estimator of $\beta$ | Hat = estimated from data |
| $\tilde{\beta}$ | Some other estimator of $\beta$ | E.g., GLS, IV, restricted estimator |
| $\bar{\beta}$ | Infeasible/pseudo estimator | Rarely used |
| $\beta_0$ | True parameter value | Population value under the DGP |
| $\text{plim}$ | Probability limit | $\text{plim}_{n\to\infty}\hat{\beta}$ as $n \to \infty$ |
| $\mathbf{1}_n$ or $\iota_n$ | $n$-vector of ones | $\iota_n = (1, 1, \ldots, 1)^T \in \mathbb{R}^n$ |
| $M_X$ | Annihilator / residual maker | $M_X = I_n - X(X^TX)^{-1}X^T$; $M_X y = \hat{\varepsilon}$ |
| $P_X$ | Projection / hat matrix | $P_X = X(X^TX)^{-1}X^T$; $P_X y = \hat{y}$ |
| $H$ | Hat matrix | Same as $P_X$; $H_{ii}$: leverage of observation $i$ |
| $\hat{\varepsilon}$ or $\hat{u}$ | OLS residuals | $\hat{\varepsilon} = y - X\hat{\beta} = M_X y$ |
| $\varepsilon$ or $u$ | True disturbances | $y = X\beta + \varepsilon$; unobserved |
| $\tilde{x}_{it}$ | Within-demeaned variable | $\tilde{x}_{it} = x_{it} - \bar{x}_i$ (panel FE) |
| $\bar{x}_i$ | Within-group mean | $\bar{x}_i = T^{-1}\sum_t x_{it}$ |
| $n$ | Number of observations | Cross-sectional units |
| $T$ | Time periods | Panel time dimension |
| $k$ | Number of regressors | Columns of $X$ (including intercept) |
| $q$ | Number of restrictions | In $F$-test: $R\beta = r$ has $q$ rows |
| $Z$ | Instrument matrix | In IV/2SLS: $n \times \ell$ matrix, $\ell \geq k$ |
| $\Omega$ | Error covariance matrix | $\text{Var}(\varepsilon | X) = \sigma^2 \Omega$ in GLS |
| $\hat{\sigma}^2$ | Estimated error variance | $\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)$ |
| $\text{SSR}$ | Sum of squared residuals | $\hat{\varepsilon}^T\hat{\varepsilon}$ |
| $\text{SST}$ | Total sum of squares | $(y - \bar{y}\iota)^T(y - \bar{y}\iota) = y^TM_\iota y$ |
| $\text{SSE}$ | Explained sum of squares | $\text{SST} - \text{SSR}$ |
| $R^2$ | Coefficient of determination | $1 - \text{SSR}/\text{SST}$ |
| $\bar{R}^2$ | Adjusted $R^2$ | $1 - (1-R^2)(n-1)/(n-k)$ |
| $\text{AIC}$ | Akaike information criterion | $-2\ell + 2k$ |
| $\text{BIC}$ | Bayesian information criterion | $-2\ell + k\ln(n)$ |
---
*Back to [Appendix Overview](index.qmd) | Previous: [Common Proofs](common-proofs.qmd) | Next: [Formula Cheat Sheets](cheat-sheets.qmd)*