Mathematical Notation Guide

Panduan Baca Simbol-simbol yang Bikin Pusing

1 Overview

This guide documents the notation conventions used throughout this course. It is organized by domain so you can quickly locate the type of symbol you are looking for. Where this course follows a specific source, it is noted.

Primary references: Hayashi (2000) Econometrics; Bishop (2006) Pattern Recognition and Machine Learning.


2 Section 1: Scalars, Vectors, Matrices, and Variables

The typographic convention determines what kind of object a symbol represents. Learn this once — it applies everywhere.

Convention Object Type Examples
Lowercase italic Scalar (number) \(x\), \(\alpha\), \(n\), \(\lambda\), \(t\)
Lowercase bold Column vector \(\mathbf{x}\), \(\boldsymbol{\beta}\), \(\boldsymbol{\mu}\), \(\mathbf{w}\)
Uppercase italic or bold Matrix \(X\), \(A\), \(\Sigma\), \(\mathbf{W}\), \(\mathbf{H}\)
Uppercase italic Random variable \(X\), \(Y\), \(Z\), \(U\)
Lowercase italic Realization (observed value) \(x\), \(y\), \(z\) (same letter, lowercase)
Calligraphic Set, hypothesis class, space \(\mathcal{X}\), \(\mathcal{H}\), \(\mathcal{F}\), \(\mathcal{D}\)
Blackboard bold Number set, probability, expectation \(\mathbb{R}\), \(\mathbb{E}[X]\), \(\mathbb{P}(A)\)
Warning

Ambiguity alert: In econometrics, \(X\) is often used simultaneously as the data matrix (matrix) and as a random variable. Context determines interpretation. When the distinction matters, the data matrix is often written explicitly as a fixed \(n \times k\) matrix.

2.1 Indexing Conventions

Notation Meaning
\(x_i\) The \(i\)-th element of vector \(\mathbf{x}\)
\(X_{ij}\) or \(X_{i,j}\) Element in row \(i\), column \(j\) of matrix \(X\)
\(X_{i \cdot}\) or \(x_i^T\) The \(i\)-th row of \(X\) (as a row vector)
\(X_{\cdot j}\) The \(j\)-th column of \(X\) (as a column vector)
\(x_{it}\) Observation \(i\) at time \(t\) (panel data)
\(\beta_j\) The \(j\)-th element of parameter vector \(\boldsymbol\beta\)
\(\lambda_i(A)\) The \(i\)-th eigenvalue of matrix \(A\)
\(\sigma_i(A)\) The \(i\)-th singular value of matrix \(A\)
\(A_{[i:j, k:l]}\) Submatrix (rows \(i\) to \(j\), columns \(k\) to \(l\))

3 Section 2: Linear Algebra Notation

3.1 Matrix Operations

Symbol Meaning Notes
\(A^T\) or \(A'\) Transpose \((A^T)_{ij} = A_{ji}\); Hayashi uses \('\)
\(A^{-1}\) Inverse Exists iff \(A\) is square and \(\det(A) \neq 0\)
\(A^{-T}\) or \((A^T)^{-1}\) Inverse transpose \((A^{-1})^T = (A^T)^{-1}\)
\(A^{1/2}\) Matrix square root For PSD \(A\): symmetric \(B\) s.t. \(B^2 = A\)
\(A^{-1/2}\) Inverse square root \((A^{1/2})^{-1}\); used in whitening transforms
\(A^+\) or \(A^\dagger\) Moore-Penrose pseudoinverse Generalizes inverse to non-square/singular matrices
\(\text{tr}(A)\) Trace \(\text{tr}(A) = \sum_i A_{ii} = \sum_i \lambda_i\)
\(\det(A)\) or \(|A|\) Determinant \(\det(A) = \prod_i \lambda_i\)
\(\text{rank}(A)\) Rank Dimension of column space
\(\text{diag}(a_1, \ldots, a_n)\) Diagonal matrix Square matrix with \(a_i\) on diagonal, 0 elsewhere
\(\text{diag}(A)\) Diagonal extraction Column vector of diagonal elements of \(A\)
\(A \otimes B\) Kronecker product Block matrix; \((A \otimes B)_{(i-1)q+k,(j-1)r+l} = A_{ij}B_{kl}\)
\(\text{vec}(A)\) Vectorization Stack columns of \(A\) into a single vector
\(\text{vech}(A)\) Half-vectorization Stack lower triangle (for symmetric \(A\))

3.2 Subspaces

Symbol Meaning
\(C(A)\) or \(\text{col}(A)\) Column space (range) of \(A\)
\(N(A)\) or \(\ker(A)\) Null space (kernel) of \(A\): \(\{x : Ax = 0\}\)
\(\dim(\mathcal{V})\) Dimension of subspace \(\mathcal{V}\)
\(\mathcal{V}^\perp\) Orthogonal complement of \(\mathcal{V}\)

3.3 Matrix Norms

Symbol Name Formula
\(\|A\|_F\) Frobenius norm \(\sqrt{\text{tr}(A^TA)} = \sqrt{\sum_{i,j} A_{ij}^2}\)
\(\|A\|_2\) or \(\|A\|\) Spectral norm (operator 2-norm) \(\sigma_{\max}(A)\), largest singular value
\(\|A\|_1\) Matrix 1-norm \(\max_j \sum_i |A_{ij}|\) (max column sum)
\(\|A\|_\infty\) Matrix \(\infty\)-norm \(\max_i \sum_j |A_{ij}|\) (max row sum)
\(\|A\|_*\) Nuclear norm \(\sum_i \sigma_i(A)\), sum of singular values

3.4 Positive Definiteness

Symbol Meaning Condition
\(A \succ 0\) Positive definite (PD) \(\mathbf{x}^TA\mathbf{x} > 0\) for all \(\mathbf{x} \neq 0\)
\(A \succeq 0\) Positive semidefinite (PSD) \(\mathbf{x}^TA\mathbf{x} \geq 0\) for all \(\mathbf{x}\)
\(A \prec 0\) Negative definite \(\mathbf{x}^TA\mathbf{x} < 0\) for all \(\mathbf{x} \neq 0\)
\(A \preceq 0\) Negative semidefinite \(\mathbf{x}^TA\mathbf{x} \leq 0\) for all \(\mathbf{x}\)
\(A \succeq B\) \(A - B \succeq 0\) Used in Gauss-Markov comparisons
\(\mathbb{S}^n_+\) PSD cone Set of all \(n \times n\) PSD matrices
\(\mathbb{S}^n_{++}\) PD cone Set of all \(n \times n\) PD matrices

4 Section 3: Probability and Statistics Notation

4.1 Basic Probability

Symbol Meaning Notes
\(P(A)\) or \(\Pr(A)\) Probability of event \(A\) \(P: \mathcal{F} \to [0,1]\)
\(p(x)\) or \(f(x)\) Probability mass/density function PMF (discrete) or PDF (continuous)
\(F(x)\) Cumulative distribution function \(F(x) = P(X \leq x)\)
\(F^{-1}(q)\) Quantile function Inverse CDF; \(p\)-th quantile
\(P(A|B)\) Conditional probability \(P(A|B) = P(A \cap B)/P(B)\)
\(p(x|y)\) Conditional density Density of \(X\) given \(Y = y\)

4.2 Moments and Parameters

Symbol Meaning Alternative
\(E[X]\) or \(\mathbb{E}[X]\) Expected value \(\mu\), \(\mu_X\)
\(E[X|Y]\) Conditional expectation Function of \(Y\)
\(\text{Var}(X)\) Variance \(\sigma^2\), \(\sigma_X^2\)
\(\text{Cov}(X,Y)\) Covariance \(\sigma_{XY}\)
\(\text{Cor}(X,Y)\) Correlation \(\rho_{XY} = \text{Cov}(X,Y)/(\sigma_X\sigma_Y)\)
\(M_X(t)\) Moment generating function \(E[e^{tX}]\)
\(\varphi_X(t)\) Characteristic function \(E[e^{itX}]\)

4.3 Distributional Notation

Symbol Meaning
\(X \sim P\) \(X\) is distributed according to distribution \(P\)
\(X \sim N(\mu, \sigma^2)\) \(X\) is normally distributed
\(X \sim N(\boldsymbol\mu, \Sigma)\) \(X\) is multivariate normal
\(X \stackrel{d}{=} Y\) \(X\) and \(Y\) are equal in distribution
\(X \stackrel{d}{\to} Y\) \(X\) converges in distribution to \(Y\)
\(X \perp Y\) \(X\) and \(Y\) are independent
\(X \perp Y \mid Z\) \(X\) and \(Y\) are conditionally independent given \(Z\)
\(X \mid Y = y\) The random variable \(X\) conditioned on \(Y = y\)

4.4 Key Distributions (Notation Summary)

Symbol Distribution
\(\phi(x)\) Standard normal PDF: \(\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\)
\(\Phi(x)\) Standard normal CDF: \(\int_{-\infty}^x \phi(t)\,dt\)
\(N(\mu, \sigma^2)\) Normal with mean \(\mu\), variance \(\sigma^2\)
\(N(\boldsymbol\mu, \Sigma)\) Multivariate normal
\(\chi^2(k)\) or \(\chi^2_k\) Chi-squared with \(k\) degrees of freedom
\(t(k)\) or \(t_k\) Student’s \(t\) with \(k\) degrees of freedom
\(F(m,n)\) or \(F_{m,n}\) \(F\)-distribution with \(m, n\) degrees of freedom
\(\text{Bernoulli}(p)\) Bernoulli with success probability \(p\)
\(\text{Bin}(n,p)\) Binomial
\(\text{Poisson}(\lambda)\) Poisson with rate \(\lambda\)
\(\text{Exp}(\lambda)\) Exponential with rate \(\lambda\)
\(\text{Gamma}(\alpha, \beta)\) Gamma (shape \(\alpha\), rate \(\beta\))
\(\text{Beta}(\alpha, \beta)\) Beta distribution

5 Section 4: Convergence Notation

Essential for asymptotics in econometrics.

Symbol Name Meaning
\(X_n \xrightarrow{p} X\) Convergence in probability \(P(|X_n - X| > \varepsilon) \to 0\) for all \(\varepsilon > 0\)
\(X_n \xrightarrow{d} X\) Convergence in distribution \(F_{X_n}(x) \to F_X(x)\) at continuity points
\(X_n \xrightarrow{a.s.} X\) Almost sure convergence \(P(\lim_{n\to\infty} X_n = X) = 1\)
\(X_n \xrightarrow{L^2} X\) Mean square convergence \(E[(X_n - X)^2] \to 0\)
\(X_n \xrightarrow{L^p} X\) \(L^p\) convergence \(E[|X_n - X|^p] \to 0\)
\(\text{plim}_{n\to\infty} X_n = X\) Probability limit Same as \(\xrightarrow{p}\); Hayashi notation
\(o_p(1)\) Converges to 0 in prob. \(X_n = o_p(1)\) means \(X_n \xrightarrow{p} 0\)
\(o_p(a_n)\) Little-o in probability \(X_n / a_n \xrightarrow{p} 0\)
\(O_p(1)\) Bounded in probability \(\forall \varepsilon > 0\), \(\exists M < \infty\): \(\limsup_n P(|X_n| > M) < \varepsilon\)
\(O_p(a_n)\) Big-O in probability \(X_n / a_n = O_p(1)\)
\(\sqrt{n}\)-consistent Rate of convergence \(\sqrt{n}(\hat{\theta} - \theta) = O_p(1)\); standard for MLE/OLS

Hierarchy (stronger to weaker): a.s. \(\Rightarrow\) \(L^2\) \(\Rightarrow\) \(p\) \(\Rightarrow\) \(d\).


6 Section 5: Calculus and Optimization Notation

6.1 Differential Calculus

Symbol Meaning Notes
\(\nabla f(\mathbf{x})\) Gradient Column vector: \((\partial f/\partial x_1, \ldots, \partial f/\partial x_n)^T\)
\(\nabla^2 f(\mathbf{x})\) or \(H_f(\mathbf{x})\) Hessian matrix \((H)_{ij} = \partial^2 f / \partial x_i \partial x_j\)
\(\frac{\partial f}{\partial \mathbf{x}}\) Jacobian (row gradient) Row vector \((\partial f/\partial x_1, \ldots, \partial f/\partial x_n)\); convention varies
\(Df(\mathbf{x})\) Jacobian (general) \(m \times n\) matrix for \(f: \mathbb{R}^n \to \mathbb{R}^m\)
\(\frac{\partial f}{\partial X}\) Matrix derivative Matrix of same shape as \(X\)
\(\frac{d}{dx}\) Total derivative For scalar functions of a scalar
\(\frac{\partial}{\partial x}\) Partial derivative Treating all other variables as constant
\(\dot{x}\) or \(x'\) Time derivative \(dx/dt\) in dynamic systems

6.2 Optimization

Symbol Meaning
\(\text{argmin}_{x} f(x)\) Minimizer: \(x^* = \text{argmin}_x f(x)\) means \(f(x^*) \leq f(x)\) for all feasible \(x\)
\(\text{argmax}_{x} f(x)\) Maximizer
\(f^* = \min_x f(x)\) Minimum value of \(f\)
\(x^*\) Optimal solution
\(\mathcal{L}(x, \lambda)\) Lagrangian: \(f(x) - \lambda^T g(x)\)
\(\lambda^*\) Optimal Lagrange multiplier
\(g(x) \leq 0\) Inequality constraint
\(h(x) = 0\) Equality constraint

6.3 Vector Norms

Symbol Name Formula
\(\|\mathbf{x}\|\) or \(\|\mathbf{x}\|_2\) Euclidean / \(L^2\) norm \(\sqrt{\sum_i x_i^2}\)
\(\|\mathbf{x}\|_1\) \(L^1\) norm / Manhattan \(\sum_i |x_i|\)
\(\|\mathbf{x}\|_p\) \(L^p\) norm \(\left(\sum_i |x_i|^p\right)^{1/p}\)
\(\|\mathbf{x}\|_\infty\) \(L^\infty\) / Chebyshev norm \(\max_i |x_i|\)
\(\|\mathbf{x}\|_A\) \(A\)-weighted norm \(\sqrt{\mathbf{x}^T A \mathbf{x}}\) (for \(A \succ 0\))

7 Section 6: Machine Learning Specific Notation

Symbol Meaning Notes
\(\mathcal{L}(\theta)\) or \(L(\theta)\) Loss function Sometimes \(\ell\) for log-likelihood
\(\mathcal{L}(\theta; \mathcal{D})\) Loss given dataset \(\mathcal{D}\) Explicit data dependence
\(\hat{y}\) Predicted value \(\hat{y} = f(\mathbf{x}; \theta)\)
\(\mathbb{1}[\cdot]\) or \(\mathbf{1}[\cdot]\) Indicator function \(\mathbb{1}[A] = 1\) if \(A\) true, 0 otherwise
\(W^{[l]}\) Weight matrix of layer \(l\) Superscript in brackets for layer index
\(b^{[l]}\) Bias vector of layer \(l\)
\(a^{[l]}\) Activation of layer \(l\) \(a^{[l]} = \sigma(W^{[l]}a^{[l-1]} + b^{[l]})\)
\(\odot\) Hadamard (element-wise) product \((A \odot B)_{ij} = A_{ij} B_{ij}\)
\(\sigma(\cdot)\) Sigmoid function \(\sigma(x) = 1/(1+e^{-x})\); also used for std dev
\(D_{KL}(P \| Q)\) KL divergence from \(Q\) to \(P\) \(\sum_x p(x)\log(p(x)/q(x))\)
\(H(P)\) Entropy of \(P\) \(-\sum_x p(x)\log p(x)\)
\(H(P, Q)\) Cross-entropy \(H(P) + D_{KL}(P\|Q)\)
\(I(X; Y)\) Mutual information \(H(X) - H(X|Y)\)
\(\mathcal{D}\) Dataset \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n\)
\(\mathcal{H}\) Hypothesis class Set of candidate functions
\(f_\theta\) Parameterized function Neural network / model with parameters \(\theta\)
\(\phi(\mathbf{x})\) Feature map \(\phi: \mathcal{X} \to \mathcal{F}\); used in kernel methods
\(k(\mathbf{x}, \mathbf{x}')\) Kernel function \(k(\mathbf{x}, \mathbf{x}') = \langle\phi(\mathbf{x}), \phi(\mathbf{x}')\rangle\)

8 Section 7: Econometrics Specific Notation

Symbol Meaning Notes
\(\hat{\beta}\) OLS/estimator of \(\beta\) Hat = estimated from data
\(\tilde{\beta}\) Some other estimator of \(\beta\) E.g., GLS, IV, restricted estimator
\(\bar{\beta}\) Infeasible/pseudo estimator Rarely used
\(\beta_0\) True parameter value Population value under the DGP
\(\text{plim}\) Probability limit \(\text{plim}_{n\to\infty}\hat{\beta}\) as \(n \to \infty\)
\(\mathbf{1}_n\) or \(\iota_n\) \(n\)-vector of ones \(\iota_n = (1, 1, \ldots, 1)^T \in \mathbb{R}^n\)
\(M_X\) Annihilator / residual maker \(M_X = I_n - X(X^TX)^{-1}X^T\); \(M_X y = \hat{\varepsilon}\)
\(P_X\) Projection / hat matrix \(P_X = X(X^TX)^{-1}X^T\); \(P_X y = \hat{y}\)
\(H\) Hat matrix Same as \(P_X\); \(H_{ii}\): leverage of observation \(i\)
\(\hat{\varepsilon}\) or \(\hat{u}\) OLS residuals \(\hat{\varepsilon} = y - X\hat{\beta} = M_X y\)
\(\varepsilon\) or \(u\) True disturbances \(y = X\beta + \varepsilon\); unobserved
\(\tilde{x}_{it}\) Within-demeaned variable \(\tilde{x}_{it} = x_{it} - \bar{x}_i\) (panel FE)
\(\bar{x}_i\) Within-group mean \(\bar{x}_i = T^{-1}\sum_t x_{it}\)
\(n\) Number of observations Cross-sectional units
\(T\) Time periods Panel time dimension
\(k\) Number of regressors Columns of \(X\) (including intercept)
\(q\) Number of restrictions In \(F\)-test: \(R\beta = r\) has \(q\) rows
\(Z\) Instrument matrix In IV/2SLS: \(n \times \ell\) matrix, \(\ell \geq k\)
\(\Omega\) Error covariance matrix \(\text{Var}(\varepsilon | X) = \sigma^2 \Omega\) in GLS
\(\hat{\sigma}^2\) Estimated error variance \(\hat{\sigma}^2 = \hat{\varepsilon}^T\hat{\varepsilon}/(n-k)\)
\(\text{SSR}\) Sum of squared residuals \(\hat{\varepsilon}^T\hat{\varepsilon}\)
\(\text{SST}\) Total sum of squares \((y - \bar{y}\iota)^T(y - \bar{y}\iota) = y^TM_\iota y\)
\(\text{SSE}\) Explained sum of squares \(\text{SST} - \text{SSR}\)
\(R^2\) Coefficient of determination \(1 - \text{SSR}/\text{SST}\)
\(\bar{R}^2\) Adjusted \(R^2\) \(1 - (1-R^2)(n-1)/(n-k)\)
\(\text{AIC}\) Akaike information criterion \(-2\ell + 2k\)
\(\text{BIC}\) Bayesian information criterion \(-2\ell + k\ln(n)\)

Back to Appendix Overview | Previous: Common Proofs | Next: Formula Cheat Sheets