Important Probability Distributions
Normal, t, Chi-squared, F, dan Teman-temannya
1 Kenapa Ini Penting?
Semua hypothesis testing bergantung pada knowing the distribution of your test statistic. t-test assumes t-distribution. Chi-square test assumes chi-square distribution. F-test assumes F-distribution. Tanpa ini, p-value hanyalah angka tanpa makna — kamu tidak bisa bilang apakah 0.03 itu “kecil” tanpa tahu distribusi referensinya.
Setiap kali kamu menulis summary(lm(...)) di R dan melihat p-value, di balik layar R sedang menghitung tail probability dari t-distribution. Setiap kali kamu melakukan F-test untuk joint significance, R menggunakan F-distribution. Memahami distribusi ini bukan sekedar teori — ini adalah dasar dari setiap angka di tabel regresi kamu.
2 1. Normal Distribution
Normal distribution adalah distribusi paling fundamental dalam statistik. Ini bukan karena alam selalu normal, tapi karena Central Limit Theorem menjamin bahwa sample means akan approximately normal untuk large samples.
Variabel random \(X\) dikatakan berdistribusi normal dengan mean \(\mu\) dan variance \(\sigma^2\), ditulis \(X \sim N(\mu, \sigma^2)\), jika memiliki PDF:
\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad x \in \mathbb{R}\]
Standard Normal: \(Z \sim N(0,1)\) adalah special case dengan \(\mu=0\), \(\sigma^2=1\).
CDF dari standard normal dinotasikan \(\Phi(z) = P(Z \leq z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\).
Standardisasi: jika \(X \sim N(\mu, \sigma^2)\), maka \(Z = \frac{X-\mu}{\sigma} \sim N(0,1)\).
2.1 Properties Normal Distribution
1. Simetri: \(f(x) = f(2\mu - x)\), distribusi simetris di sekitar mean \(\mu\).
2. 68-95-99.7 Rule: \[P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827\] \[P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545\] \[P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973\]
3. Linear Combinations: Jika \(X_i \sim N(\mu_i, \sigma_i^2)\) independent, maka: \[\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i\mu_i,\ \sum_{i=1}^n a_i^2\sigma_i^2\right)\]
Ini sangat penting untuk OLS: \(\hat{\beta} = (X^TX)^{-1}X^Ty\) adalah linear combination dari \(y\), jadi jika \(\varepsilon \sim N(0, \sigma^2 I)\), maka \(\hat{\beta} \sim N(\beta, \sigma^2(X^TX)^{-1})\).
4. MGF: \(M_X(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right)\)
2.2 Multivariate Normal
Vector \(\mathbf{X} = (X_1, \ldots, X_k)^T\) berdistribusi multivariate normal, ditulis \(\mathbf{X} \sim N(\boldsymbol{\mu}, \Sigma)\), jika memiliki PDF:
\[f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)\]
dimana \(\boldsymbol{\mu} \in \mathbb{R}^k\) adalah mean vector dan \(\Sigma\) adalah \(k \times k\) positive definite covariance matrix.
Key Properties of MVN:
- Marginals normal: \(X_i \sim N(\mu_i, \Sigma_{ii})\)
- Conditionals normal: \(X_1 | X_2 = x_2 \sim N(\mu_{1|2}, \Sigma_{1|2})\), dimana: \[\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2)\] \[\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\]
- Linear combinations normal: \(A\mathbf{X} \sim N(A\boldsymbol{\mu}, A\Sigma A^T)\)
- Uncorrelated \(\Rightarrow\) Independent (hanya untuk MVN, tidak berlaku umum!)
3 2. Chi-Squared Distribution
Chi-squared distribution muncul secara alami dari kuadrat standard normals.
Jika \(Z_1, Z_2, \ldots, Z_k \sim N(0,1)\) iid, maka: \[V = \sum_{i=1}^k Z_i^2 \sim \chi^2(k)\]
dimana \(k\) adalah degrees of freedom.
Moments: - \(E[V] = k\) - \(\text{Var}(V) = 2k\)
PDF: \(f(v) = \frac{v^{k/2-1}e^{-v/2}}{2^{k/2}\Gamma(k/2)}\) untuk \(v > 0\).
3.1 Koneksi ke Sample Variance
Ini adalah salah satu result paling penting dalam statistik inferensial:
\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \quad \text{jika } X_i \sim N(\mu, \sigma^2) \text{ iid}\]
dimana \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\).
Kenapa degrees of freedom \(n-1\)? Karena kita kehilangan 1 df ketika mengestimasi \(\mu\) dengan \(\bar{X}\).
3.2 Additive Property
Jika \(V_1 \sim \chi^2(m)\) dan \(V_2 \sim \chi^2(n)\) independent, maka: \[V_1 + V_2 \sim \chi^2(m+n)\]
Aplikasi: Di regresi, SST = SSR + SSE, dan ini bisa dipecah menjadi komponen chi-squared.
Digunakan dalam: - Goodness-of-fit tests (Pearson chi-squared) - Test of independence (contingency tables) - Lagrange Multiplier (LM) tests - Likelihood Ratio (LR) tests - Testing restrictions in econometric models
4 3. Student’s t-Distribution
t-distribution muncul ketika kita tidak tahu \(\sigma^2\) dan harus mengestimasinya.
Jika \(Z \sim N(0,1)\) dan \(V \sim \chi^2(k)\) independent, maka: \[T = \frac{Z}{\sqrt{V/k}} \sim t(k)\]
dimana \(k\) adalah degrees of freedom.
Moments (untuk \(k > 2\)): - \(E[T] = 0\) - \(\text{Var}(T) = \frac{k}{k-2}\)
Tails lebih berat dibanding normal. Semakin kecil \(k\), semakin heavy tail.
4.1 Konvergensi ke Normal
\[t(k) \xrightarrow{d} N(0,1) \text{ sebagai } k \to \infty\]
Praktisnya: untuk \(k > 30\), t-distribution sudah sangat dekat dengan normal. Untuk \(k > 120\), hampir identik.
4.2 Koneksi ke OLS
Ini adalah koneksi kritis dengan regresi:
\[\frac{\hat{\beta}_j - \beta_j}{\hat{s.e.}(\hat{\beta}_j)} \sim t(n-k)\]
di bawah asumsi Gauss-Markov ditambah normalitas errors (\(\varepsilon \sim N(0,\sigma^2 I)\)), dimana: - \(n\) = jumlah observasi - \(k\) = jumlah parameter (termasuk intercept) - \(n-k\) = residual degrees of freedom
Digunakan untuk: - Individual coefficient significance tests - Confidence intervals untuk means dengan unknown variance - Comparing two group means (Welch’s t-test)
5 4. F-Distribution
F-distribution adalah ratio dari dua chi-squared distributions. Ini adalah distribusi untuk F-statistics dalam regression.
Jika \(U \sim \chi^2(m)\) dan \(V \sim \chi^2(n)\) independent, maka: \[F = \frac{U/m}{V/n} \sim F(m, n)\]
dimana \(m\) = numerator df dan \(n\) = denominator df.
Mean: \(E[F] = \frac{n}{n-2}\) untuk \(n > 2\) (sedikit di atas 1)
PDF: Right-skewed, defined for \(F > 0\).
5.1 F-test untuk Joint Hypotheses
Untuk testing \(H_0: R\beta = r\) (linear restrictions) di regression \(y = X\beta + \varepsilon\):
\[F = \frac{(R\hat{\beta}-r)^T\left[R(X^TX)^{-1}R^T\right]^{-1}(R\hat{\beta}-r)/m}{\hat{\sigma}^2} \sim F(m, n-k)\]
dimana: - \(m\) = number of restrictions - \(n-k\) = residual degrees of freedom - \(\hat{\sigma}^2 = SSE/(n-k)\) = estimated error variance
5.2 Relasi F dan t
Jika \(T \sim t(n)\), maka \(T^2 \sim F(1, n)\).
Ini berarti: uji dua sisi dengan t-test ekuivalen dengan F-test dengan 1 restriction.
Saat kamu test \(H_0: \beta_j = 0\) menggunakan t-statistic, F-statistic untuk restriction yang sama persis adalah kuadrat dari t-statistic. Dan p-value-nya identik!
Ini juga menjelaskan kenapa F-test “overall significance” di summary regresi selalu tests semua slope coefficients sekaligus.
Digunakan dalam: - Joint hypothesis tests (\(\beta_2 = \beta_3 = 0\), dll.) - ANOVA (comparing means across groups) - Overall significance of regression - Chow test (structural break) - Testing equality of variances
6 5. Distribusi Lain yang Penting
6.1 Bernoulli dan Binomial
Bernoulli: \(X \in \{0, 1\}\), \(P(X=1) = p\). Foundation dari logistic regression. \[E[X] = p, \quad \text{Var}(X) = p(1-p)\]
Binomial: \(X = \sum_{i=1}^n X_i\) dengan \(X_i \sim \text{Bernoulli}(p)\) iid. \[P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}\] \[E[X] = np, \quad \text{Var}(X) = np(1-p)\]
6.2 Poisson Distribution
Untuk count data, waiting times, number of events in interval.
\[P(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots\]
Key property: \(E[X] = \text{Var}(X) = \lambda\) (mean equals variance — useful for testing overdispersion!).
Digunakan dalam: count regression models (Poisson regression), queueing theory, spatial point processes.
6.3 Exponential Distribution
Waiting time between Poisson events.
\[f(x) = \lambda e^{-\lambda x}, \quad x > 0\] \[E[X] = 1/\lambda, \quad \text{Var}(X) = 1/\lambda^2\]
Memoryless property: \(P(X > s+t | X > s) = P(X > t)\) — distribusi satu-satunya yang memiliki sifat ini di domain kontinu.
6.4 Beta Distribution
Untuk modeling probabilities dan proportions (nilai antara 0 dan 1).
\[f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)}, \quad x \in (0,1)\]
Digunakan sebagai conjugate prior untuk binomial probability \(p\) dalam Bayesian statistics. - \(\alpha = \beta = 1\): Uniform prior - Large \(\alpha, \beta\): Strong prior belief near \(\alpha/(\alpha+\beta)\)
6.5 Gamma Distribution
Generalisasi exponential; sum of \(k\) exponential variables.
\[f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}, \quad x > 0\] \[E[X] = \alpha/\beta, \quad \text{Var}(X) = \alpha/\beta^2\]
Catatan: \(\chi^2(k) = \text{Gamma}(k/2, 1/2)\). Chi-squared adalah special case dari Gamma.
6.6 Dirichlet Distribution
Generalisasi Beta untuk probability vectors \((\pi_1, \ldots, \pi_K)\) dengan \(\sum \pi_k = 1\).
Digunakan sebagai conjugate prior untuk multinomial probabilities dalam Bayesian stats; foundation dari Latent Dirichlet Allocation (LDA) dalam topic modeling.
7 6. Worked Example 1: Full t-Test
Problem: Data gaji (juta rupiah) untuk 15 karyawan baru:
28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1, 26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7
Test \(H_0: \mu = 28\) vs \(H_1: \mu \neq 28\) pada \(\alpha = 0.05\).
Step 1: Hitung sample statistics. \[\bar{x} = \frac{1}{15}\sum x_i = \frac{441.3}{15} = 29.42\] \[s^2 = \frac{1}{14}\sum(x_i - \bar{x})^2 = 4.21, \quad s = 2.052\] \[SE = \frac{s}{\sqrt{n}} = \frac{2.052}{\sqrt{15}} = 0.530\]
Step 2: Hitung t-statistic. \[t = \frac{\bar{x} - \mu_0}{SE} = \frac{29.42 - 28}{0.530} = \frac{1.42}{0.530} = 2.679\]
Step 3: Bandingkan dengan critical value.
Dengan \(df = n-1 = 14\) dan \(\alpha/2 = 0.025\): \(t_{14, 0.025} = 2.145\).
\(|t| = 2.679 > 2.145\), jadi reject \(H_0\).
Step 4: Hitung p-value. \[p = 2 \times P(T_{14} > 2.679) = 2 \times 0.0085 = 0.017\]
Conclusion: Cukup bukti untuk menolak \(H_0\). Rata-rata gaji secara signifikan berbeda dari 28 juta (\(t(14) = 2.68\), \(p = 0.017\)).
# Data
gaji <- c(28.5, 31.2, 25.8, 33.4, 29.1, 27.6, 30.8, 32.1,
26.4, 28.9, 31.5, 27.3, 29.8, 30.2, 28.7)
# Manual calculation
n <- length(gaji)
x_bar <- mean(gaji)
s <- sd(gaji)
SE <- s / sqrt(n)
t_stat <- (x_bar - 28) / SE
p_value <- 2 * pt(-abs(t_stat), df = n - 1)
cat(sprintf("n=%d, mean=%.3f, s=%.3f, SE=%.3f\n", n, x_bar, s, SE))
cat(sprintf("t-stat=%.3f, p-value=%.4f\n", t_stat, p_value))
# Or just use t.test()
t.test(gaji, mu = 28, alternative = "two.sided")8 7. Worked Example 2: F-Test for Joint Hypotheses
Problem: Regresi \(y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\), \(n = 50\).
Test \(H_0: \beta_2 = \beta_3 = 0\) (kedua variabel tidak signifikan secara bersama-sama).
Method 1: Restricted vs Unrestricted RSS
- Unrestricted model (\(y = \beta_1 + \beta_2 x_2 + \beta_3 x_3\)): \(SSE_U = 245.3\), \(df = 47\)
- Restricted model (\(y = \beta_1\)): \(SSE_R = 312.8\), \(df = 49\)
\[F = \frac{(SSE_R - SSE_U)/m}{SSE_U/(n-k)} = \frac{(312.8 - 245.3)/2}{245.3/47} = \frac{67.5/2}{5.219} = \frac{33.75}{5.219} = 6.47\]
Critical value: \(F_{0.05}(2, 47) \approx 3.20\).
\(F = 6.47 > 3.20\), reject \(H_0\). Setidaknya satu dari \(\beta_2, \beta_3\) tidak nol.
Method 2: Using R
set.seed(42)
n <- 50
x2 <- rnorm(n)
x3 <- rnorm(n)
y <- 2 + 0.5*x2 + 0.8*x3 + rnorm(n)
# Unrestricted model
model_u <- lm(y ~ x2 + x3)
# Restricted model
model_r <- lm(y ~ 1)
# F-test via anova()
anova(model_r, model_u)
# Or via linearHypothesis in car package
library(car)
linearHypothesis(model_u, c("x2 = 0", "x3 = 0"))
# Manual F-statistic
SSE_U <- sum(resid(model_u)^2)
SSE_R <- sum(resid(model_r)^2)
m <- 2 # restrictions
k <- 3 # parameters in unrestricted (including intercept)
F_stat <- ((SSE_R - SSE_U)/m) / (SSE_U/(n-k))
p_val <- pf(F_stat, df1=m, df2=n-k, lower.tail=FALSE)
cat(sprintf("F = %.3f, p = %.4f\n", F_stat, p_val))9 8. R Code: Working with Distributions
# ============================================================
# NORMAL DISTRIBUTION
# ============================================================
pnorm(1.96) # P(Z < 1.96) ≈ 0.975
pnorm(1.96) - pnorm(-1.96) # P(-1.96 < Z < 1.96) ≈ 0.95
qnorm(0.975) # 97.5th percentile ≈ 1.96
dnorm(0, mean=0, sd=1) # PDF at 0 = 1/sqrt(2*pi) ≈ 0.399
# Standardization
x <- 115; mu <- 100; sigma <- 15
z <- (x - mu) / sigma # z = 1.0
pnorm(z) # P(X < 115) = P(Z < 1)
# ============================================================
# t-DISTRIBUTION
# ============================================================
pt(-2, df=10) # P(T < -2) with 10 df
qt(0.975, df=10) # Critical value ≈ 2.228
pt(-2, df=10) * 2 # Two-sided p-value
# Comparing t vs normal critical values
cat("df=5:", qt(0.975, df=5), "\n") # 2.571
cat("df=30:", qt(0.975, df=30), "\n") # 2.042
cat("df=120:", qt(0.975, df=120), "\n")# 1.980
cat("Normal:", qnorm(0.975), "\n") # 1.960
# ============================================================
# CHI-SQUARED DISTRIBUTION
# ============================================================
pchisq(9.49, df=4) # P(X < 9.49) with 4 df ≈ 0.95
qchisq(0.95, df=4) # 95th percentile ≈ 9.49
dchisq(4, df=4) # PDF at x=4
# Simulate chi-squared from normals
n_sim <- 10000
set.seed(2024)
z1 <- rnorm(n_sim); z2 <- rnorm(n_sim); z3 <- rnorm(n_sim)
chi_sq_3 <- z1^2 + z2^2 + z3^2
hist(chi_sq_3, probability=TRUE, breaks=50,
main="Simulated Chi-sq(3)", xlab="x")
curve(dchisq(x, df=3), add=TRUE, col="red", lwd=2)
legend("topright", "Theoretical", col="red", lwd=2)
# ============================================================
# F-DISTRIBUTION
# ============================================================
pf(3.89, df1=2, df2=20) # P(F < 3.89) ≈ 0.938
qf(0.95, df1=2, df2=20) # 95th percentile ≈ 3.49
pf(3.89, df1=2, df2=20, lower.tail=FALSE) # Upper tail
# Verify: t^2 ~ F(1, n)
t_val <- 2.5
n_df <- 20
t_sq <- t_val^2
# P(T > 2.5 in two-sided test) should equal P(F > 6.25)
p_t_twosided <- 2 * pt(-abs(t_val), df=n_df)
p_f <- pf(t_sq, df1=1, df2=n_df, lower.tail=FALSE)
cat("p-value from t:", p_t_twosided, "\n") # Same!
cat("p-value from F:", p_f, "\n") # Same!
# ============================================================
# PLOTTING: Visualize how t approaches Normal
# ============================================================
curve(dnorm(x), from=-4, to=4, lwd=2, col="black",
main="t-distribution vs Normal", ylab="Density")
curve(dt(x, df=1), add=TRUE, col="blue", lwd=2, lty=2)
curve(dt(x, df=5), add=TRUE, col="green", lwd=2, lty=3)
curve(dt(x, df=30), add=TRUE, col="orange", lwd=2, lty=4)
legend("topright",
legend=c("N(0,1)", "t(1)", "t(5)", "t(30)"),
col=c("black","blue","green","orange"),
lwd=2, lty=1:4)10 9. Koneksi ke Aplikasi Riil
Dalam spatial econometrics: Moran’s I (test for spatial autocorrelation) under the null hypothesis is asymptotically \(N(0,1)\). Kamu menggunakannya seperti z-test biasa.
Dalam time series: Ljung-Box test untuk autocorrelation in residuals uses \(\chi^2(m)\) di mana \(m\) adalah number of lags tested. \(Q = n(n+2)\sum_{k=1}^m \frac{\hat{\rho}_k^2}{n-k} \xrightarrow{d} \chi^2(m)\).
Dalam ML — hypothesis testing for model comparison: Likelihood Ratio test \(-2\log\Lambda \xrightarrow{d} \chi^2(m)\) digunakan untuk membandingkan nested models (misalnya logistic regression dengan dan tanpa feature).
Bottom line: Knowing the null distribution is knowing your test. Sebelum interpret p-value apapun, tanyakan: “distribution apa yang digunakan untuk menghasilkan p-value ini?”
11 10. Practice Problems
Problem 1: Standard normal probabilities. - Hitung \(P(Z > 1.96)\) - Hitung \(P(-2 < Z < 2)\) - Hitung \(P(Z < -1.645)\)
Jawaban: - \(P(Z > 1.96) = 1 - \Phi(1.96) = 1 - 0.975 = 0.025\) - \(P(-2 < Z < 2) = \Phi(2) - \Phi(-2) = 0.9772 - 0.0228 = 0.9545\) - \(P(Z < -1.645) = 1 - \Phi(1.645) = 0.05\)
Problem 2: Chi-squared moments.
Jika \(X \sim \chi^2(5)\): - Hitung \(E[X]\) dan \(\text{Var}(X)\) - Hitung \(P(X > 11.07)\) (use R atau table) - Jika \(Y \sim \chi^2(3)\) independent dari \(X\), berapa distribusi \(X + Y\)?
Jawaban: - \(E[X] = k = 5\); \(\text{Var}(X) = 2k = 10\) - \(P(X > 11.07) = 1 - \text{pchisq}(11.07, df=5) \approx 0.05\) - \(X + Y \sim \chi^2(5+3) = \chi^2(8)\) (additive property)
Problem 3: Show \(T^2 \sim F(1,n)\).
Hint: \(T = Z/\sqrt{V/n}\) dimana \(Z \sim N(0,1)\), \(V \sim \chi^2(n)\) independent. Maka: \[T^2 = \frac{Z^2}{V/n} = \frac{Z^2/1}{V/n}\]
\(Z^2 \sim \chi^2(1)\) (karena kuadrat dari standard normal). Jadi \(T^2 = \frac{\chi^2(1)/1}{\chi^2(n)/n} \sim F(1,n)\). QED.
Problem 4: Suppose a regression has \(n=100\) observations and \(k=5\) parameters. You want to test \(H_0: \beta_3 = \beta_4 = \beta_5 = 0\). The unrestricted model has \(SSE_U = 180\) and the restricted model has \(SSE_R = 210\).
- Compute the F-statistic
- What is the distribution under \(H_0\)?
- Is the result significant at \(\alpha = 0.05\)? (\(F_{0.05}(3, 95) \approx 2.70\))
Jawaban: \[F = \frac{(210-180)/3}{180/95} = \frac{10}{1.895} = 5.28 > 2.70 \Rightarrow \text{Reject } H_0\]
# Verify
pf(5.28, df1=3, df2=95, lower.tail=FALSE) # p ≈ 0.002